nsivabalan commented on issue #7829: URL: https://github.com/apache/hudi/issues/7829#issuecomment-1454335343
I might know why this could be happening. if you can clarify something, we can confirm. for a given df, while generating the primary key using monotonically increasing func, if we call the key generation twice, it could return diff keys right? just that spark will ensure they are unqiue. but it may not be the same? bcoz, down the line, our upsert partitioner is based on the hash of the record key. so, if for one of the spark partitions, if spark dag is re-triggered, chances that re-attempt of primary key generation could result in a new set of keys (whose hash value) might differ compared to first time, you might see duplicates or data loss. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
