ruilch opened a new issue, #9779:
URL: https://github.com/apache/hudi/issues/9779

   Hello folks.
   
   Despite [Services 
Documentation](https://hudi.apache.org/docs/compaction#offline-compaction) is 
saying that for large scale loads it's beneficial to use offline compactor for 
MoR tables, it turns out that you actually can't turn compactor off completely. 
   
   Moreover, string `and the write parameter compaction.schedule.enable is 
enabled by default.` hints that if you set `compaction.schedule.enable` to 
`false`, the compaction won't start. However, there's no such configuration as 
`compaction.schedule.enable`. We have only `compaction.schedule.enableD` that 
is applicable only to _Flink_. 
   
   So people actually need to guess, and the more closest parameter is 
`hoodie.datasource.compaction.async.enable`. 
   
   But digging further it actually revealed that you can only make compaction 
process either sync or async. 
   
   I.e. setting `hoodie.datasource.compaction.async.enable` off would mean that 
compactor would run in inline mode (i.e it'll set `hoodie.compact.inline` into 
`true`). 
   
   Code: 
https://github.com/apache/hudi/blob/35ed607693384bef6132006ff1efd8e9607d2785/hudi-spark-datasource/hudi-spark-common/src/main/java/org/apache/hudi/DataSourceUtils.java#L175-L197
   
   Line 190 sets the compact on/off, `off` for inline means that somewhere in 
the code async compactor would start.
   
   There's no place in the code (or at least I didn't find it) where it doesn't 
start compaction (either inline of async) and you can't set both off since the 
code above always chooses one option or another and modifies the configuration.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Turn "hoodie.datasource.compaction.async.enable" off.
   2. See that compaction is still triggering.
   
   **Expected behavior**
   
   I would expect that turning `hoodie.datasource.compaction.async.enable` off 
would actually mean that there won't be any compaction happening, so I can 
delegate that to an offline separate job.
   
   I would also expect to have more clear documentation on how to extract 
compactor to offline job. Now it just states that its somehow possible, but 
does not clarify how which makes things confusing.
   
   **Environment Description**
   
   * Hudi version : 0.13.1
   
   * Spark version : 3.4.0 EMR Serverless, FIFO scheduler
   
   * Hive version : ?
   
   * Hadoop version : ?
   
   * Storage (HDFS/S3/GCS..) : AWS S3
   
   * Running on Docker? (yes/no) : no
   
   
   **Additional context**
   
   See example screenwhot with `hoodie.datasource.compaction.async.enable` off.
   ![Screenshot 2023-09-24 at 16 00 
20](https://github.com/apache/hudi/assets/892781/3eb3cb18-b472-4cf5-95bf-92b6da7419a7)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to