nsivabalan opened a new pull request, #8107:
URL: https://github.com/apache/hudi/pull/8107
### Change Logs
This PR is introducing capability for Record Key Auto Generation
### Background
At present ingesting data into Hudi has a few unavoidable prerequisites one
of which is specifying record key configuration (with record key serving as
primary key). Necessity to specify primary key is one of the core assumptions
built into Hudi model centered around being able to update the target table
efficiently.
However, some payloads actually don't have a naturally present record key:
for ex, when ingesting some kind of "logs" into Hudi there might be no unique
identifier held in every record that could serve the purpose of being record
key, while meeting global uniqueness requirements of the primary key.
Nevertheless, we want to make sure that Hudi is able to support such payloads
while still providing for Hudi's core strength
#### Requirements
1. Auto-generated record keys have to provide for global uniqueness w/in the
table, not just w/in the batch.
This is necessary to make sure we're able to support updating such tables.
2. Keys should be generated in a way that would allow for their efficient
compression
This is necessary to make sure that auto-generated keys are not bringing
substantial overhead (on storage and in handling)
3. Auto generation of record keys should be robust against partial failures
and retries like task and stage failures and retries. In other words, such
events should not result in data duplication or data loss.
#### Implementation
To support payloads with no naturally present record key, here we're
proposing to enable new mode of operation for Hudi where synthetic, globally
unique (w/in the table) record key will be injected upon persistence of the
dataset as Hudi table.
For achieving our goal of providing globally unique keys we're planning on
relying on the following synthetic key format comprised of 2 components
- (Reserved) Commit timestamp: Use reserved commit timestamp as prefix (to
provide for global uniqueness of rows)
- Engine's task partitionId or parallelizable unit for the engine of
interest. (Spark PartitionId incase of spark engine)
- Row id: unique identifier of the row (record) w/in the provided task
partition.
Combining them in a single string key as below
```"${commit_timestamp}_${partition_id}_${row_id}"```
For row-id generation we're planning on using generator very similar in
spirit to `monotonically_increasing_id()` expression from Spark to generate
unique identity value for every row w/in batch (could be easily implemented for
any parallel execution framework like Flink, etc)
Please note, that this setup is very similar to how currently
`_hoodie_commit_seqno` is produced.
- Added support to all 4 different writes in spark(spark ds, spark
streaming, spark-ds and deltastreamer).
- Added/Fixed all built in key generators to support auto generation of
record keys.
- Added guard rails around some of the configs when used along side auto
generation of record keys (for eg, de-dup within same batch is not allowed,
"upsert" operation type is not allowed etc).
### Impact
- Eases the usability of Hudi among users who are looking to ingest
immutable datasets.
### Risk level (write none, low medium or high below)
Low.
### Documentation Update
Will be updating the configurations page on this. Also, might need to update
our website to call out this feature separately. may be under KeyGenerators
page.
### Contributor's checklist
- [ ] Read through [contributor's
guide](https://hudi.apache.org/contribute/how-to-contribute)
- [ ] Change Logs and Impact were stated clearly
- [ ] Adequate tests were added if applicable
- [ ] CI passed
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]