ruilch opened a new issue, #9779: URL: https://github.com/apache/hudi/issues/9779
Hello folks. Despite [Services Documentation](https://hudi.apache.org/docs/compaction#offline-compaction) is saying that for large scale loads it's beneficial to use offline compactor for MoR tables, it turns out that you actually can't turn compactor off completely. Moreover, string `and the write parameter compaction.schedule.enable is enabled by default.` hints that if you set `compaction.schedule.enable` to `false`, the compaction won't start. However, there's no such configuration as `compaction.schedule.enable`. We have only `compaction.schedule.enableD` that is applicable only to _Flink_. So people actually need to guess, and the more closest parameter is `hoodie.datasource.compaction.async.enable`. But digging further it actually revealed that you can only make compaction process either sync or async. I.e. setting `hoodie.datasource.compaction.async.enable` off would mean that compactor would run in inline mode (i.e it'll set `hoodie.compact.inline` into `true`). Code: https://github.com/apache/hudi/blob/35ed607693384bef6132006ff1efd8e9607d2785/hudi-spark-datasource/hudi-spark-common/src/main/java/org/apache/hudi/DataSourceUtils.java#L175-L197 Line 190 sets the compact on/off, `off` for inline means that somewhere in the code async compactor would start. There's no place in the code (or at least I didn't find it) where it doesn't start compaction (either inline of async) and you can't set both off since the code above always chooses one option or another and modifies the configuration. **To Reproduce** Steps to reproduce the behavior: 1. Turn "hoodie.datasource.compaction.async.enable" off. 2. See that compaction is still triggering. **Expected behavior** I would expect that turning `hoodie.datasource.compaction.async.enable` off would actually mean that there won't be any compaction happening, so I can delegate that to an offline separate job. I would also expect to have more clear documentation on how to extract compactor to offline job. Now it just states that its somehow possible, but does not clarify how which makes things confusing. **Environment Description** * Hudi version : 0.13.1 * Spark version : 3.4.0 EMR Serverless, FIFO scheduler * Hive version : ? * Hadoop version : ? * Storage (HDFS/S3/GCS..) : AWS S3 * Running on Docker? (yes/no) : no **Additional context** See example screenwhot with `hoodie.datasource.compaction.async.enable` off.  -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
