[ 
https://issues.apache.org/jira/browse/HUDI-6478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-6478:
---------------------------------
    Labels: pull-request-available  (was: )

> Simplify INSERT_INTO configs
> ----------------------------
>
>                 Key: HUDI-6478
>                 URL: https://issues.apache.org/jira/browse/HUDI-6478
>             Project: Apache Hudi
>          Issue Type: Improvement
>          Components: spark-sql
>            Reporter: sivabalan narayanan
>            Priority: Major
>              Labels: pull-request-available
>
> We have 2 to 3 diff configs in the mix for INSERT_INTO command. lets try to 
> simplify them.
>  
> hoodie.sql.insert.mode, drop dups, hoodie.sql.bulk.insert.enable and 
> datasource.operation.type.
>  
> Rough notes:
>  
> hoodie.sql.bulk.insert.enable: true | false.
>  
> hoodie.sql.insert.mode: STRICT| NON_STRICT | UPSERT
> STRICT: we can't re-ingest same record again. will throw if found duplicates 
> to be ingested again.
> NON_STRICT: no such constraints. but has to be set along w/ bulk_insert(if 
> its enabled). if not, exception will be thrown.
> UPSERT: default insert.mode(until a week back where in we switch to make 
> bulk_insert the default for INSERT_INTO). will take care of de-dup. will use 
> OverwriteWithLatestAvroPayload(which means that we can update an existing 
> record across batches).
>  
> datasource.operation.type: insert, bulk_insert, upsert
>  
> drop.dups: Drop new incoming records if it already exists.
>  
> Proposal:
>  
>  * We will introduce a new config named "hoodie.sql.write.operation" which 
> will have 3 values ("insert", "bulk_insert" and "upsert"). Default value will 
> be "insert" for INSERT_INTO.
>  ** Deprecate hoodie.sql.insert.mode and "hoodie.sql.bulk.insert.enable".
>  * Also, enable "hoodie.merge.allow.duplicate.on.inserts" = true if operation 
> type is "Insert" for both spark-sql and spark-ds. This will maintain 
> duplicates but still help w/ small file management with "insert"s.
>  * Introduce a new config named "hoodie.datasource.insert.dedupe.policy" 
> whose valid values are "ignore, fail and drop". Make "ignore" as default. 
> "fail" will mimic "STRICT" mode we support as of now. Even spark-ds users can 
> use the fail/STRICT behavior if need be.
>  ** Deprecate hoodie.datasource.insert.drop.dups.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to