nsivabalan commented on PR #7640: URL: https://github.com/apache/hudi/pull/7640#issuecomment-1382232963
hey @kazdy : we also jammed quite a bit before arriving at this solution. For eg, we did take a stab at generating unique Ids for every record [here](https://github.com/apache/hudi/pull/7622), but the problem as stated by Tim might not work for 7622. for eg, if we zoom into what happens for a commit in hudi is, keyGen -> index look up -> upsert partitioner -> write files by executor (merge handle or create handle or append handle) -> may be write to metadata table -> complete commit. Main crux here is that, in Upsert partitioner, we assign records to diff insert buckets based on record key hash. lets say upsert partitioned determined to add 3 new insert buckets and split 30k records among 3 insert bucket (file groups). This assignment is done using hashing of record key. Given this, if due to failures, if keyGen stage was retriggered for a subset of spark partitions again, and when it reaches the upsert partitioner, it could get assigned to a diff insert bucket compared to its 1st attempt and so there are chances we will miss some records or add pack more records to one file group that what we intended. Let me know if this makes sense. happy to jam to see if we can really pull this off by a row Id sort of generating rather than based on record payload. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
