gtwuser commented on issue #6869:
URL: https://github.com/apache/hudi/issues/6869#issuecomment-1273179895

   > Hi,
   > 
   > the example given states you end up with two different record keys, since 
delivery is a part of the key. In this case abcd.delivery and abcd.payment make 
for the record key: 'hoodie.datasource.write.recordkey.field': 
'abcd.delivery,abcd.payment' so 1st recordkey is 2000,upi 2nd recordkey is 
3000,upi
   > 
   > Record key works like a PK on a table (unique, non-nullable field). In 
your case, you end up with two records with different record key and that's 
expected. Precombine and upserts are supposed to maintain the uniqueness of 
recordKey.
   > 
   > assume you use only delivery field as record key to make it easier so if 
you have record with delivery:3000 hudi will do insert (if record with same 
record key does not exists in the table), if record with delivery:2000 and it 
already exists in the table then update
   > 
   > precombine works before write, incoming batch of data is deduplicated 
based on record key and precombine field so if in incoming batch you have two 
records with the same record key, then one with greater precombine field value 
will be passed to write operation
   
   @kazdy iam sorry i feel my description not clear, let me rephrase it the 
scenario is when at two different times same record is being pushed to hudi 
table then i find two records in the table instead of one. By same record i 
mean same record keys ie.. `delivery : 2000` in the record1 and again `delivery 
: 2000` in the record2 and both these records are sent in two different files 
at different timings. 
   ```bash
   recordKey : `delivery` # just one attribute
   Glue job 1 reading file1 has delivery:`2000` 
   Glue job 2 reading file2 also has delivery:`2000`  # this should lead to 
upsert as we have same record key, but the observation is otherwise
   ```
   Post running both glue jobs there should ideally be only one record (the 
updated one) in the table but we find more two records. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to