[ 
https://issues.apache.org/jira/browse/HUDI-3772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan closed HUDI-3772.
-------------------------------------
    Resolution: Fixed

> We automatically enable InProcessLockProvider and lazy rollbacks in spark 
> datasoruce write if compaction configs are not set for MOR
> ------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HUDI-3772
>                 URL: https://issues.apache.org/jira/browse/HUDI-3772
>             Project: Apache Hudi
>          Issue Type: Bug
>          Components: configs, multi-writer
>            Reporter: sivabalan narayanan
>            Assignee: sivabalan narayanan
>            Priority: Blocker
>              Labels: pull-request-available
>             Fix For: 0.11.0
>
>
> Sometime back, we added a fix to hudi, where in we automatically detect if 
> any async table services are enabled and if no lock providers are configured, 
> we automatically enable InProcessLockProvider, OCC and lazy rollbacks. This 
> is a pre-requisite for enabling metadata table and hence we had put in this 
> fix. 
>  
> This worked out well for COW, clustering. But for MOR, it was tricky, and we 
> had to have explicit checks for below condition and auto enable it
> if table type = MOR and if compaction is async -> enable 
> InProcessLockProvider. 
> bcoz, for COW there is no compaction, but for MOR, compaction has to be 
> enabled. its a question of whether its inline or async. 
>  
> This all works out well, if user explicitly sets the compaction config as 
> below
> {code:java}
> df.write.format("hudi").
>      |   options(getQuickstartWriteConfigs).
>      |   option(PRECOMBINE_FIELD_OPT_KEY, "ts").
>      |   option(RECORDKEY_FIELD_OPT_KEY, "uuid").
>      |   option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
>      |   option("hoodie.datasource.write.table.type","MERGE_ON_READ").
>      |   option("hoodie.compact.inline","true").
>      |   option("hoodie.compact.inline.max.delta.commits","2").
>      |   option(TABLE_NAME, tableName).
>      |   mode(Append).
>      |   save(basePath) {code}
>  
> So, we clearly detect its inline and do not enable InProcessLockProvier. 
> Auto detection also works well w/ Deltastreamer code path, since we can 
> clearly detect whether compaction is inline or async. for inline, 
> Deltastreamer will explicitly set "hoodie.compact.inline" to "true".
>  
> But the tricky part is, with spark datasource, if user skips the compaction 
> config altogether, we auto detect that its inline and go ahead and enable 
> inProcessLockProvider. In addition, OCC and lazy rollbacks as well. So, this 
> is a behavior change for a simple single writer coming from 0.10.0. 
>  
> {code:java}
> df2.write.format("hudi").
>      |   options(getQuickstartWriteConfigs).
>      |   option(PRECOMBINE_FIELD_OPT_KEY, "ts").
>      |   option(RECORDKEY_FIELD_OPT_KEY, "uuid").
>      |   option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
>      |   option("hoodie.datasource.write.table.type","MERGE_ON_READ").
>      |   option(TABLE_NAME, tableName).
>      |   mode(Append).
>      |   save(basePath) {code}
>  
> Reason is that, as per code, default value for "hoodie.compact.inline" is 
> "false". And so we deduce that, compaction is async if user does not 
> explicitly set it.
>  
> We have to find a way to fix this. 
> May be, in a production pipeline, its likely every write will have compaction 
> configs set. I don't see why someone will have compaction configs set for few 
> writes and not for others. But lets try to see if we can maintain the same 
> behavior. 
>  
>  
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to