nsivabalan opened a new pull request, #7146:
URL: https://github.com/apache/hudi/pull/7146
### Change Logs
Sometimes users prefer to sort the incoming records based on some columns
with insert/upsert. As of now, sorting is supported only w/ bulk_insert. This
patch adds the support with insert and upsert operation as well.
Typical use-case:
Classic problem of event time vs query predicates. in case of uber's trip
data, dataset will be partitioned on datestr, but most of the queries might be
based on city_id. So, instead of relying on clustering to sort after the fact,
this patch adds support to sort before ingesting only.
### Impact
Users will now be able to optionally sort records based on columns of their
choice while ingesting records with insert or upsert.
Configs of interest:
hoodie.write.sort.mode: possible values NONE, GLOBAL_SORT and
PARTITIONER_SORT
hoodie.write.sort.cols: comma separated list of columns to sort.
### Risk level (write none, low medium or high below)
Medium
### Documentation Update
_Describe any necessary documentation update if there is any new feature,
config, or user-facing change_
- _The config description must be updated if new configs are added or the
default value of the configs are changed_
- _Any new feature or user-facing change requires updating the Hudi website.
Please create a Jira ticket, attach the
ticket number here and follow the
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to
make
changes to the website._
### Contributor's checklist
- [ ] Read through [contributor's
guide](https://hudi.apache.org/contribute/how-to-contribute)
- [ ] Change Logs and Impact were stated clearly
- [ ] Adequate tests were added if applicable
- [ ] CI passed
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]