[ 
https://issues.apache.org/jira/browse/HUDI-3772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-3772:
--------------------------------------
    Description: 
Sometime back, we added a fix to hudi, where in we automatically detect if any 
async table services are enabled and if no lock providers are configured, we 
automatically enable InProcessLockProvider, OCC and lazy rollbacks. This is a 
pre-requisite for enabling metadata table and hence we had put in this fix. 

 

This worked out well for COW, clustering. But for MOR, it was tricky, and we 
had to have explicit checks for below condition and auto enable it

if table type = MOR and if compaction is async -> enable InProcessLockProvider. 

bcoz, for COW there is no compaction, but for MOR, compaction has to be 
enabled. its a question of whether its inline or async. 

 

This all works out well, if user explicitly sets the compaction config as below
{code:java}
df.write.format("hudi").
     |   options(getQuickstartWriteConfigs).
     |   option(PRECOMBINE_FIELD_OPT_KEY, "ts").
     |   option(RECORDKEY_FIELD_OPT_KEY, "uuid").
     |   option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
     |   option("hoodie.datasource.write.table.type","MERGE_ON_READ").
     |   option("hoodie.compact.inline","true").
     |   option("hoodie.compact.inline.max.delta.commits","2").
     |   option(TABLE_NAME, tableName).
     |   mode(Append).
     |   save(basePath) {code}
 

So, we clearly detect its inline and do not enable InProcessLockProvier. 

Auto detection also works well w/ Deltastreamer code path, since we can clearly 
detect whether compaction is inline or async. for inline, Deltastreamer will 
explicitly set "hoodie.compact.inline" to "true".

 

But the tricky part is, with spark datasource, if user skips the compaction 
config altogether, we auto detect that its inline and go ahead and enable 
inProcessLockProvider. In addition, OCC and lazy rollbacks as well. So, this is 
a behavior change for a simple single writer coming from 0.10.0. 

 
{code:java}
df2.write.format("hudi").
     |   options(getQuickstartWriteConfigs).
     |   option(PRECOMBINE_FIELD_OPT_KEY, "ts").
     |   option(RECORDKEY_FIELD_OPT_KEY, "uuid").
     |   option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
     |   option("hoodie.datasource.write.table.type","MERGE_ON_READ").
     |   option(TABLE_NAME, tableName).
     |   mode(Append).
     |   save(basePath) {code}
 

Reason is that, as per code, default value for "hoodie.compact.inline" is 
"false". And so we deduce that, compaction is async if user does not explicitly 
set it.

 

We have to find a way to fix this. 

May be, in a production pipeline, its likely every write will have compaction 
configs set. I don't see why someone will have compaction configs set for few 
writes and not for others. But lets try to see if we can maintain the same 
behavior. 

 

 

 

 

 

 

 

 

  was:
Sometime back, we added a fix to hudi, where in we automatically detect if any 
async table services are enabled and if no lock providers are configured, we 
automatically enable InProcessLockProvider, OCC and lazy rollbacks. This is a 
pre-requisite for enabling metadata table and hence we had put in this fix. 

 

This worked out well for COW, clustering. But for MOR, it was tricky, and we 
had to have explicit checks for below condition and auto enable it

if table type = MOR and if compaction is async -> enable InProcessLockProvider. 

bcoz, for COW there is no compaction, but for MOR, compaction has to be 
enabled. its a question of whether its inline or async. 

 

This all works out well, if user explicitly sets the compaction config as below
{code:java}
df.write.format("hudi").
     |   options(getQuickstartWriteConfigs).
     |   option(PRECOMBINE_FIELD_OPT_KEY, "ts").
     |   option(RECORDKEY_FIELD_OPT_KEY, "uuid").
     |   option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
     |   option("hoodie.datasource.write.table.type","MERGE_ON_READ").
     |   option("hoodie.compact.inline","true").
     |   option("hoodie.compact.inline.max.delta.commits","2").
     |   option(TABLE_NAME, tableName).
     |   mode(Append).
     |   save(basePath) {code}
 

So, we clearly detect its inline and do not enable InProcessLockProvier. 

Auto detection also works well w/ Deltastreamer code path, since we can clearly 
detect whether compaction is inline or async. for inline, Deltastreamer will 
explicitly set "hoodie.compact.inline" to "true".

 

But the tricky part is, with spark datasource, if user skips the compaction 
config altogether, we auto detect that its inline and go ahead and enable 
inProcessLockProvider. In addition, OCC and lazy rollbacks as well. So, this is 
a behavior change for a simple single writer coming from 0.10.0. 

 
{code:java}
df2.write.format("hudi").
     |   options(getQuickstartWriteConfigs).
     |   option(PRECOMBINE_FIELD_OPT_KEY, "ts").
     |   option(RECORDKEY_FIELD_OPT_KEY, "uuid").
     |   option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
     |   option("hoodie.datasource.write.table.type","MERGE_ON_READ").
     |   option(TABLE_NAME, tableName).
     |   mode(Append).
     |   save(basePath) {code}
 

Reason is that, as per code, default value for "hoodie.compact.inline" is 
"false". And so we deduce that, compaction is async if user does not explicitly 
set it.

 

We have to find a way to fix this. 

 

 

 

 

 

 

 

 


> We automatically enable InProcessLockProvider and lazy rollbacks in spark 
> datasoruce write if compaction configs are not set for MOR
> ------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HUDI-3772
>                 URL: https://issues.apache.org/jira/browse/HUDI-3772
>             Project: Apache Hudi
>          Issue Type: Bug
>          Components: configs, multi-writer
>            Reporter: sivabalan narayanan
>            Assignee: sivabalan narayanan
>            Priority: Blocker
>             Fix For: 0.11.0
>
>
> Sometime back, we added a fix to hudi, where in we automatically detect if 
> any async table services are enabled and if no lock providers are configured, 
> we automatically enable InProcessLockProvider, OCC and lazy rollbacks. This 
> is a pre-requisite for enabling metadata table and hence we had put in this 
> fix. 
>  
> This worked out well for COW, clustering. But for MOR, it was tricky, and we 
> had to have explicit checks for below condition and auto enable it
> if table type = MOR and if compaction is async -> enable 
> InProcessLockProvider. 
> bcoz, for COW there is no compaction, but for MOR, compaction has to be 
> enabled. its a question of whether its inline or async. 
>  
> This all works out well, if user explicitly sets the compaction config as 
> below
> {code:java}
> df.write.format("hudi").
>      |   options(getQuickstartWriteConfigs).
>      |   option(PRECOMBINE_FIELD_OPT_KEY, "ts").
>      |   option(RECORDKEY_FIELD_OPT_KEY, "uuid").
>      |   option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
>      |   option("hoodie.datasource.write.table.type","MERGE_ON_READ").
>      |   option("hoodie.compact.inline","true").
>      |   option("hoodie.compact.inline.max.delta.commits","2").
>      |   option(TABLE_NAME, tableName).
>      |   mode(Append).
>      |   save(basePath) {code}
>  
> So, we clearly detect its inline and do not enable InProcessLockProvier. 
> Auto detection also works well w/ Deltastreamer code path, since we can 
> clearly detect whether compaction is inline or async. for inline, 
> Deltastreamer will explicitly set "hoodie.compact.inline" to "true".
>  
> But the tricky part is, with spark datasource, if user skips the compaction 
> config altogether, we auto detect that its inline and go ahead and enable 
> inProcessLockProvider. In addition, OCC and lazy rollbacks as well. So, this 
> is a behavior change for a simple single writer coming from 0.10.0. 
>  
> {code:java}
> df2.write.format("hudi").
>      |   options(getQuickstartWriteConfigs).
>      |   option(PRECOMBINE_FIELD_OPT_KEY, "ts").
>      |   option(RECORDKEY_FIELD_OPT_KEY, "uuid").
>      |   option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
>      |   option("hoodie.datasource.write.table.type","MERGE_ON_READ").
>      |   option(TABLE_NAME, tableName).
>      |   mode(Append).
>      |   save(basePath) {code}
>  
> Reason is that, as per code, default value for "hoodie.compact.inline" is 
> "false". And so we deduce that, compaction is async if user does not 
> explicitly set it.
>  
> We have to find a way to fix this. 
> May be, in a production pipeline, its likely every write will have compaction 
> configs set. I don't see why someone will have compaction configs set for few 
> writes and not for others. But lets try to see if we can maintain the same 
> behavior. 
>  
>  
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to