nsivabalan opened a new pull request, #9123:
URL: https://github.com/apache/hudi/pull/9123
### Change Logs
With the intent to simplify different config options with INSERT_INTO
spark-sql, we are doing a overhaul. We have 3 to 4 configs with INSERT_INTO
like Operation type, insert mode, drop dupes, enable bulk insert configs. Here
is what the simplification brings in.
```
- We will introduce a new config named "hoodie.sql.write.operation" which
will have 3 values ("insert", "bulk_insert" and "upsert"). Default value will
be "insert" for INSERT_INTO.
- Deprecate hoodie.sql.insert.mode and
"hoodie.sql.bulk.insert.enable".
- Also, enable "hoodie.merge.allow.duplicate.on.inserts" = true if
operation type is "Insert" for both spark-sql and spark-ds. This will maintain
duplicates but still help w/ small file management with "insert"s.
- Introduce a new config named "hoodie.datasource.insert.dedupe.policy"
whose valid values are "ignore, fail and drop". Make "ignore" as default.
"fail" will mimic "STRICT" mode we support as of now.
- Deprecate hoodie.datasource.insert.drop.dups.
```
When both old and new configs are set, new config will take effect.
When only new configs are set, new config will take effect.
When neither is set, new configs and their default will take effect.
When only old configs are set, old configs will take effect. Please do note
that we are deprecating the use of these old configs. In 2 releases, we will
completely remove these configs. So, would recommend users to migrate to new
configs.
Note: old refers to "hoodie.sql.insert.mode" and new config refers to
"hoodie.sql.write.operation".
Behavior change:
With this patch, we are also switching the default behavior with INSERT_INTO
to use "insert" as the operation underneath. Until 0.13.1, default behavior was
"upsert". In other words, if you ingest same batch of records in commit1 and in
commit2, hudi will do an upsert and will return only the latest value with
snapshot read. But with this patch, we are changing the default behavior to use
"insert" as the name (INSERT_INTO) signifies. So, ingesting the same batch of
records in commit1 and in commit2 will result in duplicates records with
snapshot read. If users override the respective config, we will honor them, but
the default behavior where none of the respective configs are overridden
explicitly, will see a behavior change.
### Impact
Usability will be improved for spark-sql users as we have deprecated few
confusing configs and tried to align with spark datasource writes. Also, this
brings in a behavior change as well. With this patch, we are also switching the
default behavior with INSERT_INTO to use "insert" as the operation underneath.
Until 0.13.1, default behavior was "upsert". In other words, if you ingest same
batch of records in commit1 and in commit2, hudi will do an upsert and will
return only the latest value with snapshot read. But with this patch, we are
changing the default behavior to use "insert" as the name (INSERT_INTO)
signifies. So, ingesting the same batch of records in commit1 and in commit2
will result in duplicates records with snapshot read. If users override the
respective config, we will honor them, but the default behavior where none of
the respective configs are overridden explicitly, will see a behavior change.
### Risk level (write none, low medium or high below)
medium
### Documentation Update
We will have to call out the behavior change as part of our release docs and
also update our quick start guide around the same.
### Contributor's checklist
- [ ] Read through [contributor's
guide](https://hudi.apache.org/contribute/how-to-contribute)
- [ ] Change Logs and Impact were stated clearly
- [ ] Adequate tests were added if applicable
- [ ] CI passed
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]