[GitHub] [hudi] ruilch opened a new issue, #9779: [SUPPORT] Spark MoR tables, can't turn Compaction completely

via GitHub Sun, 24 Sep 2023 06:15:23 -0700


ruilch opened a new issue, #9779:
URL: https://github.com/apache/hudi/issues/9779

Hello folks.

Despite [Services
Documentation](https://hudi.apache.org/docs/compaction#offline-compaction) is
saying that for large scale loads it's beneficial to use offline compactor for
MoR tables, it turns out that you actually can't turn compactor off completely.

Moreover, string `and the write parameter compaction.schedule.enable is
enabled by default.` hints that if you set `compaction.schedule.enable` to
`false`, the compaction won't start. However, there's no such configuration as
`compaction.schedule.enable`. We have only `compaction.schedule.enableD` that
is applicable only to _Flink_.

So people actually need to guess, and the more closest parameter is
`hoodie.datasource.compaction.async.enable`.

But digging further it actually revealed that you can only make compaction
process either sync or async.

I.e. setting `hoodie.datasource.compaction.async.enable` off would mean that
compactor would run in inline mode (i.e it'll set `hoodie.compact.inline` into
`true`).

Code:
https://github.com/apache/hudi/blob/35ed607693384bef6132006ff1efd8e9607d2785/hudi-spark-datasource/hudi-spark-common/src/main/java/org/apache/hudi/DataSourceUtils.java#L175-L197

Line 190 sets the compact on/off, `off` for inline means that somewhere in
the code async compactor would start.

There's no place in the code (or at least I didn't find it) where it doesn't
start compaction (either inline of async) and you can't set both off since the
code above always chooses one option or another and modifies the configuration.

**To Reproduce**

Steps to reproduce the behavior:

1. Turn "hoodie.datasource.compaction.async.enable" off.
2. See that compaction is still triggering.

**Expected behavior**

I would expect that turning `hoodie.datasource.compaction.async.enable` off
would actually mean that there won't be any compaction happening, so I can
delegate that to an offline separate job.

I would also expect to have more clear documentation on how to extract
compactor to offline job. Now it just states that its somehow possible, but
does not clarify how which makes things confusing.

**Environment Description**

* Hudi version : 0.13.1

* Spark version : 3.4.0 EMR Serverless, FIFO scheduler

* Hive version : ?

* Hadoop version : ?

* Storage (HDFS/S3/GCS..) : AWS S3

* Running on Docker? (yes/no) : no

**Additional context**

See example screenwhot with `hoodie.datasource.compaction.async.enable` off.
![Screenshot 2023-09-24 at 16 00
20](https://github.com/apache/hudi/assets/892781/3eb3cb18-b472-4cf5-95bf-92b6da7419a7)

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] ruilch opened a new issue, #9779: [SUPPORT] Spark MoR tables, can't turn Compaction completely

Reply via email to