nsivabalan commented on PR #7640:
URL: https://github.com/apache/hudi/pull/7640#issuecomment-1382232963

   hey @kazdy : we also jammed quite a bit before arriving at this solution. 
For eg, we did take a stab at generating unique Ids for every record 
[here](https://github.com/apache/hudi/pull/7622), but the problem as stated by 
Tim might not work for 7622. for eg, if we zoom into what happens for a commit 
in hudi is, 
   keyGen -> index look up -> upsert partitioner -> write files by executor 
(merge handle or create handle or append handle) -> may be write to metadata 
table -> complete commit.
   
   Main crux here is that, in Upsert partitioner, we assign records to diff 
insert buckets based on record key hash. lets say upsert partitioned determined 
to add 3 new insert buckets and split 30k records among 3 insert bucket (file 
groups). This assignment is done using hashing of record key. 
   
   Given this, if due to failures, if keyGen stage was retriggered for a subset 
of spark partitions again, and when it reaches the upsert partitioner, it could 
get assigned to a diff insert bucket compared to its 1st attempt and so there 
are chances we will miss some records or add pack more records to one file 
group that what we intended. 
   
   Let me know if this makes sense. happy to jam to see if we can really pull 
this off by a row Id sort of generating rather than based on record payload. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to