sivabalan narayanan created HUDI-6478:
-----------------------------------------
Summary: Simplify INSERT_INTO configs
Key: HUDI-6478
URL: https://issues.apache.org/jira/browse/HUDI-6478
Project: Apache Hudi
Issue Type: Improvement
Components: spark-sql
Reporter: sivabalan narayanan
We have 2 to 3 diff configs in the mix for INSERT_INTO command. lets try to
simplify them.
hoodie.sql.insert.mode, drop dups, hoodie.sql.bulk.insert.enable and
datasource.operation.type.
Rough notes:
hoodie.sql.bulk.insert.enable: true | false.
hoodie.sql.insert.mode: STRICT| NON_STRICT | UPSERT
STRICT: we can't re-ingest same record again. will throw if found duplicates to
be ingested again.
NON_STRICT: no such constraints. but has to be set along w/ bulk_insert(if its
enabled). if not, exception will be thrown.
UPSERT: default insert.mode(until a week back where in we switch to make
bulk_insert the default for INSERT_INTO). will take care of de-dup. will use
OverwriteWithLatestAvroPayload(which means that we can update an existing
record across batches).
datasource.operation.type: insert, bulk_insert, upsert
drop.dups: Drop new incoming records if it already exists.
Proposal:
* We will introduce a new config named "hoodie.sql.write.operation" which will
have 3 values ("insert", "bulk_insert" and "upsert"). Default value will be
"insert" for INSERT_INTO.
** Deprecate hoodie.sql.insert.mode and "hoodie.sql.bulk.insert.enable".
* Also, enable "hoodie.merge.allow.duplicate.on.inserts" = true if operation
type is "Insert" for both spark-sql and spark-ds. This will maintain duplicates
but still help w/ small file management with "insert"s.
* Introduce a new config named "hoodie.datasource.insert.dedupe.policy" whose
valid values are "ignore, fail and drop". Make "ignore" as default. "fail" will
mimic "STRICT" mode we support as of now. Even spark-ds users can use the
fail/STRICT behavior if need be.
** Deprecate hoodie.datasource.insert.drop.dups.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)