[GitHub] [hudi] nsivabalan opened a new pull request, #9123: [HUDI-6478] Simplifying INSERT_INTO configs for spark-sql

via GitHub Tue, 04 Jul 2023 19:33:55 -0700


nsivabalan opened a new pull request, #9123:
URL: https://github.com/apache/hudi/pull/9123


   ### Change Logs
   
   With the intent to simplify different config options with INSERT_INTO 
spark-sql, we are doing a overhaul. We have 3 to 4 configs with INSERT_INTO 
like Operation type, insert mode, drop dupes, enable bulk insert configs. Here 
is what the simplification brings in. 
   
   ```
   - We will introduce a new config named "hoodie.sql.write.operation" which 
will have 3 values ("insert", "bulk_insert" and "upsert"). Default value will 
be "insert" for INSERT_INTO.
            - Deprecate hoodie.sql.insert.mode and 
"hoodie.sql.bulk.insert.enable".
            - Also, enable "hoodie.merge.allow.duplicate.on.inserts" = true if 
operation type is "Insert" for both spark-sql and spark-ds. This will maintain 
duplicates but still help w/ small file management with "insert"s.
   - Introduce a new config named "hoodie.datasource.insert.dedupe.policy" 
whose valid values are "ignore, fail and drop". Make "ignore" as default. 
"fail" will mimic "STRICT" mode we support as of now. 
            - Deprecate hoodie.datasource.insert.drop.dups.
   ```
   
   When both old and new configs are set, new config will take effect. 
   When only new configs are set, new config will take effect. 
   When neither is set, new configs and their default will take effect. 
   When only old configs are set, old configs will take effect. Please do note 
that we are deprecating the use of these old configs. In 2 releases, we will 
completely remove these configs. So, would recommend users to migrate to new 
configs. 
   
   Note: old refers to "hoodie.sql.insert.mode" and new config refers to 
"hoodie.sql.write.operation".
   
   Behavior change: 
   With this patch, we are also switching the default behavior with INSERT_INTO 
to use "insert" as the operation underneath. Until 0.13.1, default behavior was 
"upsert". In other words, if you ingest same batch of records in commit1 and in 
commit2, hudi will do an upsert and will return only the latest value with 
snapshot read. But with this patch, we are changing the default behavior to use 
"insert" as the name (INSERT_INTO) signifies. So, ingesting the same batch of 
records in commit1 and in commit2 will result in duplicates records with 
snapshot read. If users override the respective config, we will honor them, but 
the default behavior where none of the respective configs are overridden 
explicitly, will see a behavior change. 
   
   ### Impact
   
   Usability will be improved for spark-sql users as we have deprecated few 
confusing configs and tried to align with spark datasource writes. Also, this 
brings in a behavior change as well. With this patch, we are also switching the 
default behavior with INSERT_INTO to use "insert" as the operation underneath. 
Until 0.13.1, default behavior was "upsert". In other words, if you ingest same 
batch of records in commit1 and in commit2, hudi will do an upsert and will 
return only the latest value with snapshot read. But with this patch, we are 
changing the default behavior to use "insert" as the name (INSERT_INTO) 
signifies. So, ingesting the same batch of records in commit1 and in commit2 
will result in duplicates records with snapshot read. If users override the 
respective config, we will honor them, but the default behavior where none of 
the respective configs are overridden explicitly, will see a behavior change. 
   
   ### Risk level (write none, low medium or high below)
   
   medium
   
   ### Documentation Update
   
   We will have to call out the behavior change as part of our release docs and 
also update our quick start guide around the same. 
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] nsivabalan opened a new pull request, #9123: [HUDI-6478] Simplifying INSERT_INTO configs for spark-sql

Reply via email to