nsivabalan commented on issue #3418:
URL: https://github.com/apache/hudi/issues/3418#issuecomment-894507674


   You can still dedup using bulk_insert using config 
("hoodie.combine.before.insert"), but just that it will dedup among incoming 
records. But, if you wish to update an already existing record on hudi, then 
yeah upsert is the way to go. But looks like you are doing this as one time 
migration and so bulk_insert would work out. Once the initial migration is 
complete, you can start doing "upsert"s. 
   
   There is some overloaded terminologies w/ partitioned and non-partitioned 
dataset. Let me try to explain. 
   
   Partitioned dataset: A pair of record key and partition path forms a unique 
record in hudi.
   But here, there could be two types of indices. Regular and global. 
   In case of regular (BLOOM/SIMPLE), there could be duplicate records found 
across partition path. 
   If your incoming record is (rec1, pp1,. ....), only data within pp1 will be 
searched for updates and either do update or routed as inserts. 
   Where as in case of global index, if an incoming record is (rec1, pp1, ...), 
all partitions will be searched for updates and updated accordingly. 
   
   And then there is non-partitioned dataset, where all record go into a single 
folder. NonPartitionedKeyGen. Guess this is self explanatory. This is 
synonymous to partitioned dataset above with just 1 partition. 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to