[
https://issues.apache.org/jira/browse/HUDI-3772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
sivabalan narayanan closed HUDI-3772.
-------------------------------------
Resolution: Fixed
> We automatically enable InProcessLockProvider and lazy rollbacks in spark
> datasoruce write if compaction configs are not set for MOR
> ------------------------------------------------------------------------------------------------------------------------------------
>
> Key: HUDI-3772
> URL: https://issues.apache.org/jira/browse/HUDI-3772
> Project: Apache Hudi
> Issue Type: Bug
> Components: configs, multi-writer
> Reporter: sivabalan narayanan
> Assignee: sivabalan narayanan
> Priority: Blocker
> Labels: pull-request-available
> Fix For: 0.11.0
>
>
> Sometime back, we added a fix to hudi, where in we automatically detect if
> any async table services are enabled and if no lock providers are configured,
> we automatically enable InProcessLockProvider, OCC and lazy rollbacks. This
> is a pre-requisite for enabling metadata table and hence we had put in this
> fix.
>
> This worked out well for COW, clustering. But for MOR, it was tricky, and we
> had to have explicit checks for below condition and auto enable it
> if table type = MOR and if compaction is async -> enable
> InProcessLockProvider.
> bcoz, for COW there is no compaction, but for MOR, compaction has to be
> enabled. its a question of whether its inline or async.
>
> This all works out well, if user explicitly sets the compaction config as
> below
> {code:java}
> df.write.format("hudi").
> | options(getQuickstartWriteConfigs).
> | option(PRECOMBINE_FIELD_OPT_KEY, "ts").
> | option(RECORDKEY_FIELD_OPT_KEY, "uuid").
> | option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
> | option("hoodie.datasource.write.table.type","MERGE_ON_READ").
> | option("hoodie.compact.inline","true").
> | option("hoodie.compact.inline.max.delta.commits","2").
> | option(TABLE_NAME, tableName).
> | mode(Append).
> | save(basePath) {code}
>
> So, we clearly detect its inline and do not enable InProcessLockProvier.
> Auto detection also works well w/ Deltastreamer code path, since we can
> clearly detect whether compaction is inline or async. for inline,
> Deltastreamer will explicitly set "hoodie.compact.inline" to "true".
>
> But the tricky part is, with spark datasource, if user skips the compaction
> config altogether, we auto detect that its inline and go ahead and enable
> inProcessLockProvider. In addition, OCC and lazy rollbacks as well. So, this
> is a behavior change for a simple single writer coming from 0.10.0.
>
> {code:java}
> df2.write.format("hudi").
> | options(getQuickstartWriteConfigs).
> | option(PRECOMBINE_FIELD_OPT_KEY, "ts").
> | option(RECORDKEY_FIELD_OPT_KEY, "uuid").
> | option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
> | option("hoodie.datasource.write.table.type","MERGE_ON_READ").
> | option(TABLE_NAME, tableName).
> | mode(Append).
> | save(basePath) {code}
>
> Reason is that, as per code, default value for "hoodie.compact.inline" is
> "false". And so we deduce that, compaction is async if user does not
> explicitly set it.
>
> We have to find a way to fix this.
> May be, in a production pipeline, its likely every write will have compaction
> configs set. I don't see why someone will have compaction configs set for few
> writes and not for others. But lets try to see if we can maintain the same
> behavior.
>
>
>
>
>
>
>
>
--
This message was sent by Atlassian Jira
(v8.20.1#820001)