gtwuser commented on issue #6869: URL: https://github.com/apache/hudi/issues/6869#issuecomment-1273179895
> Hi, > > the example given states you end up with two different record keys, since delivery is a part of the key. In this case abcd.delivery and abcd.payment make for the record key: 'hoodie.datasource.write.recordkey.field': 'abcd.delivery,abcd.payment' so 1st recordkey is 2000,upi 2nd recordkey is 3000,upi > > Record key works like a PK on a table (unique, non-nullable field). In your case, you end up with two records with different record key and that's expected. Precombine and upserts are supposed to maintain the uniqueness of recordKey. > > assume you use only delivery field as record key to make it easier so if you have record with delivery:3000 hudi will do insert (if record with same record key does not exists in the table), if record with delivery:2000 and it already exists in the table then update > > precombine works before write, incoming batch of data is deduplicated based on record key and precombine field so if in incoming batch you have two records with the same record key, then one with greater precombine field value will be passed to write operation @kazdy iam sorry i feel my description not clear, let me rephrase it the scenario is when at two different times same record is being pushed to hudi table then i find two records in the table instead of one. By same record i mean same record keys ie.. `delivery : 2000` in the record1 and again `delivery : 2000` in the record2 and both these records are sent in two different files at different timings. ```bash recordKey : `delivery` # just one attribute Glue job 1 reading file1 has delivery:`2000` Glue job 2 reading file2 also has delivery:`2000` # this should lead to upsert as we have same record key, but the observation is otherwise ``` Post running both glue jobs there should ideally be only one record (the updated one) in the table but we find more two records. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
