zuyanton opened a new issue #1837:
URL: https://github.com/apache/hudi/issues/1837


   We are running incremental updates to our MoR table on S3. We are running 
updates every 10 minutes. We compact every 10 commits (every ~1.5 hour). we 
have noticed that if we want to keep history for longer then few hours (set 
cleanup to clean after 50 commits ) , then compaction time starts increasing as 
number of files in s3 increase.  Chart below shows time taken to upsert 
incremental changes to the table, spikes indicate the commit when inline 
compaction got triggered. 
   
![git_compaction](https://user-images.githubusercontent.com/67354813/87568745-feb7f580-c67a-11ea-98c1-1e4598eb7cb9.PNG)
  
   when looking into logs we have noticed that majority of the time is spend 
listing recursively all files in tables S3 folder. more specifically, logs 
contain following lines:  
   ```
   20/07/15 13:58:19 INFO HoodieMergeOnReadTableCompactor: Compacting 
s3://bucket/table with commit 20200715135819
   20/07/15 14:36:04 INFO HoodieMergeOnReadTableCompactor: Compaction looking 
for files to compact in [0, 1, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 2, 20, 
21, 22, 23, 24, 25, 26, 27, 28, 29, 3, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 
4, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 5, 50, 51, 52, 53, 54, 55, 56, 57, 
58, 59, 6, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 7, 70, 71, 72, 73, 74, 75, 
76, 77, 78, 79, 8, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 9, 90, 91, 92, 93, 
94, 95, 96, 97, 98, 99] partitions
   ```
   the code lines that gets executed between those two log lines are: 
   
https://github.com/apache/hudi/blob/master/hudi-client/src/main/java/org/apache/hudi/table/action/compact/HoodieMergeOnReadTableCompactor.java#L181-L194
  
   I put log lines around various parts of that code to measure time and was 
able to narrow down to this function   
   
https://github.com/apache/hudi/blob/master/hudi-common/src/main/java/org/apache/hudi/common/fs/FSUtils.java#L225
  
   as a meter of fact compaction that took 50+ minutes, 38 of that 50+ minutes 
was executing that function, which looks like mostly recursively list files in 
S3 table location.  
   This issue observed on all tables, however it most noticeable at tables 
where incremental updates update large number of partitions (50% of all 
partitions). 
   
   **some table stats**  
   100 partitions,  initial size 100gb, initial file count 6k, we observed 50+ 
minutes compaction after table grew to 300gb, 20k files. 
   
   
   
   **Environment Description**
   
   * Hudi version : master branch
   
   * Spark version : 2.4.4
   
   * Hive version : 2.3.6
   
   * Hadoop version : 2.8.5
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : no
   
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to