[
https://issues.apache.org/jira/browse/HUDI-6478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated HUDI-6478:
---------------------------------
Labels: pull-request-available (was: )
> Simplify INSERT_INTO configs
> ----------------------------
>
> Key: HUDI-6478
> URL: https://issues.apache.org/jira/browse/HUDI-6478
> Project: Apache Hudi
> Issue Type: Improvement
> Components: spark-sql
> Reporter: sivabalan narayanan
> Priority: Major
> Labels: pull-request-available
>
> We have 2 to 3 diff configs in the mix for INSERT_INTO command. lets try to
> simplify them.
>
> hoodie.sql.insert.mode, drop dups, hoodie.sql.bulk.insert.enable and
> datasource.operation.type.
>
> Rough notes:
>
> hoodie.sql.bulk.insert.enable: true | false.
>
> hoodie.sql.insert.mode: STRICT| NON_STRICT | UPSERT
> STRICT: we can't re-ingest same record again. will throw if found duplicates
> to be ingested again.
> NON_STRICT: no such constraints. but has to be set along w/ bulk_insert(if
> its enabled). if not, exception will be thrown.
> UPSERT: default insert.mode(until a week back where in we switch to make
> bulk_insert the default for INSERT_INTO). will take care of de-dup. will use
> OverwriteWithLatestAvroPayload(which means that we can update an existing
> record across batches).
>
> datasource.operation.type: insert, bulk_insert, upsert
>
> drop.dups: Drop new incoming records if it already exists.
>
> Proposal:
>
> * We will introduce a new config named "hoodie.sql.write.operation" which
> will have 3 values ("insert", "bulk_insert" and "upsert"). Default value will
> be "insert" for INSERT_INTO.
> ** Deprecate hoodie.sql.insert.mode and "hoodie.sql.bulk.insert.enable".
> * Also, enable "hoodie.merge.allow.duplicate.on.inserts" = true if operation
> type is "Insert" for both spark-sql and spark-ds. This will maintain
> duplicates but still help w/ small file management with "insert"s.
> * Introduce a new config named "hoodie.datasource.insert.dedupe.policy"
> whose valid values are "ignore, fail and drop". Make "ignore" as default.
> "fail" will mimic "STRICT" mode we support as of now. Even spark-ds users can
> use the fail/STRICT behavior if need be.
> ** Deprecate hoodie.datasource.insert.drop.dups.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)