[GitHub] [hudi] nsivabalan commented on issue #7829: [SUPPORT] Using monotonically_increasing_id to generate record key causing duplicates on upsert

via GitHub Fri, 03 Mar 2023 18:06:07 -0800


nsivabalan commented on issue #7829:
URL: https://github.com/apache/hudi/issues/7829#issuecomment-1454335343


   I might know why this could be happening. 
   if you can clarify something, we can confirm.
   
   for a given df, while generating the primary key using monotonically 
increasing func, if we call the key generation twice, it could return diff keys 
right? just that spark will ensure they are unqiue. but it may not be the same?
   
   bcoz, down the line, our upsert partitioner is based on the hash of the 
record key. so, if for one of the spark partitions, if spark dag is 
re-triggered, chances that re-attempt of primary key generation could result in 
a new set of keys (whose hash value) might differ compared to first time, you 
might see duplicates or data loss. 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] nsivabalan commented on issue #7829: [SUPPORT] Using monotonically_increasing_id to generate record key causing duplicates on upsert

Reply via email to