[GitHub] [hudi] codope commented on issue #5371: [SUPPORT] Hudi Compaction

GitBox Wed, 20 Apr 2022 05:44:43 -0700


codope commented on issue #5371:
URL: https://github.com/apache/hudi/issues/5371#issuecomment-1103889198


   High-level recommendation is to go for [async 
compaction](https://hudi.apache.org/docs/compaction#async-compaction) instead 
of inline compaction because if your workload is update heavy, then compacting 
inline would add to the ingestion latency. 
   
   There are 3 ways in which async compaction can be triggered (details for 
each of them is in the link I shared):
   1. Using spark structured streaming
   2. Using deltastreamer continuous mode
   3. Using offline compactor utility (separate spark job) 
   
   Now, to set the right configs, we need to learn more about the workload. 
Essentially, we want to pick the right compaction strategy depending on whether 
your udpates touch recent partitions or whether they are spread randomly across 
all partitions. Inline compaction is more useful in cases where you have small 
amount of late arriving data trickling into older partitions. Also checkout 
this 
[FAQ](https://hudi.apache.org/learn/faq/#how-do-i-run-compaction-for-a-mor-dataset).
   
   Additionally, you could avoid creating lots of small files. See here for 
mode details on small file handling: 
https://hudi.apache.org/learn/faq/#how-do-i-to-avoid-creating-tons-of-small-files
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] codope commented on issue #5371: [SUPPORT] Hudi Compaction

Reply via email to