[GitHub] [hudi] mzheng-plaid commented on issue #7829: [SUPPORT] Using monotonically_increasing_id to generate record key causing duplicates on upsert

via GitHub Wed, 17 May 2023 18:56:03 -0700


mzheng-plaid commented on issue #7829:
URL: https://github.com/apache/hudi/issues/7829#issuecomment-1552300370


   @nsivabalan hmm from the description in #8107 : 
   
   > Engine's task partitionId or parallelizable unit for the engine of 
interest. (Spark PartitionId incase of spark engine)
   > Row id: unique identifier of the row (record) w/in the provided task 
partition.
   > Combining them in a single string key as below
   > 
   > "${commit_timestamp}_${partition_id}_${row_id}"
   > 
   > For row-id generation we're planning on using generator very similar in 
spirit to `monotonically_increasing_id()` expression from Spark to generate 
unique identity value for every row w/in batch (could be easily implemented for 
any parallel execution framework like Flink, etc)
   
   How does this avoid the same problem in this ticket?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] mzheng-plaid commented on issue #7829: [SUPPORT] Using monotonically_increasing_id to generate record key causing duplicates on upsert

Reply via email to