codope commented on issue #5371: URL: https://github.com/apache/hudi/issues/5371#issuecomment-1103889198
High-level recommendation is to go for [async compaction](https://hudi.apache.org/docs/compaction#async-compaction) instead of inline compaction because if your workload is update heavy, then compacting inline would add to the ingestion latency. There are 3 ways in which async compaction can be triggered (details for each of them is in the link I shared): 1. Using spark structured streaming 2. Using deltastreamer continuous mode 3. Using offline compactor utility (separate spark job) Now, to set the right configs, we need to learn more about the workload. Essentially, we want to pick the right compaction strategy depending on whether your udpates touch recent partitions or whether they are spread randomly across all partitions. Inline compaction is more useful in cases where you have small amount of late arriving data trickling into older partitions. Also checkout this [FAQ](https://hudi.apache.org/learn/faq/#how-do-i-run-compaction-for-a-mor-dataset). Additionally, you could avoid creating lots of small files. See here for mode details on small file handling: https://hudi.apache.org/learn/faq/#how-do-i-to-avoid-creating-tons-of-small-files -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
