nsivabalan opened a new pull request, #8107:
URL: https://github.com/apache/hudi/pull/8107

   ### Change Logs
   
   This PR is introducing capability for Record Key Auto Generation
   
   ### Background
   
   At present ingesting data into Hudi has a few unavoidable prerequisites one 
of which is specifying record key configuration (with record key serving as 
primary key). Necessity to specify primary key is one of the core assumptions 
built into Hudi model centered around being able to update the target table 
efficiently.
   
   However, some payloads actually don't have a naturally present record key: 
for ex, when ingesting some kind of "logs" into Hudi there might be no unique 
identifier held in every record that could serve the purpose of being record 
key, while meeting global uniqueness requirements of the primary key. 
Nevertheless, we want to make sure that Hudi is able to support such payloads 
while still providing for Hudi's core strength
   
   #### Requirements
   1. Auto-generated record keys have to provide for global uniqueness w/in the 
table, not just w/in the batch. 
   This is necessary to make sure we're able to support updating such tables.
   2. Keys should be generated in a way that would allow for their efficient 
compression
   This is necessary to make sure that auto-generated keys are not bringing 
substantial overhead (on storage and in handling)
   3. Auto generation of record keys should be robust against partial failures 
and retries like task and stage failures and retries. In other words, such 
events should not result in data duplication or data loss. 
   
   #### Implementation
   To support payloads with no naturally present record key, here we're 
proposing to enable new mode of operation for Hudi where synthetic, globally 
unique (w/in the table) record key will be injected upon persistence of the 
dataset as Hudi table.
   
   For achieving our goal of providing globally unique keys we're planning on 
relying on the following synthetic key format comprised of 2 components
   
    - (Reserved) Commit timestamp: Use reserved commit timestamp as prefix (to 
provide for global uniqueness of rows)
    - Engine's task partitionId or parallelizable unit for the engine of 
interest. (Spark PartitionId incase of spark engine)
    - Row id: unique identifier of the row (record) w/in the provided task 
partition. 
   
   Combining them in a single string key as below
   
   ```"${commit_timestamp}_${partition_id}_${row_id}"```
   
   For row-id generation we're planning on using generator very similar in 
spirit to `monotonically_increasing_id()` expression from Spark to generate 
unique identity value for every row w/in batch (could be easily implemented for 
any parallel execution framework like Flink, etc)
   
   Please note, that this setup is very similar to how currently 
`_hoodie_commit_seqno` is produced.
   
   - Added support to all 4 different writes in spark(spark ds, spark 
streaming, spark-ds and deltastreamer). 
   - Added/Fixed all built in key generators to support auto generation of 
record keys. 
   - Added guard rails around some of the configs when used along side auto 
generation of record keys (for eg, de-dup within same batch is not allowed, 
"upsert" operation type is not allowed etc). 
   
   ### Impact
   
   - Eases the usability of Hudi among users who are looking to ingest 
immutable datasets. 
   
   ### Risk level (write none, low medium or high below)
   
   Low. 
   
   ### Documentation Update
   
   Will be updating the configurations page on this. Also, might need to update 
our website to call out this feature separately. may be under KeyGenerators 
page. 
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to