dmenin opened a new issue #3394:
URL: https://github.com/apache/hudi/issues/3394


   Hello everyone.
   
   I have a quick question about hudi’s default behavior.
   I want to understand how UPSERT works for the same key in different 
scenarios.
   I am using ‘GLOBAL_SIMPLE’ index, which, from my understanding, tries to 
enforce uniqueness across all the partitions.
   
   The scenario is really straight forward: based on a timestamp, I want new 
data to be upserted and old data to be ignored. 
   The data on disk(S3) is partitioned by year\month\day so there are basically 
4 scenarios:
   
   1) Inserting NEW data on the same partition
   2) Inserting NEW data on different partition
   3) Inserting OLD data on the same partition
   4) Inserting OLD data on different  partition
   
   
    
   Below is the result of the test on these scenarios.
   It is only one row with 4 columns.
   two keys (composite - always 100, 100)
   one description
   one timestamp (it becomes the partitions and its the sort key)
   
   
   Under “DB” you see the row that was on the database (the current state of 
the database);
   Under “Row In” you can see the row that was read from the file and issued to 
the insert statement and 
   under “Result” you see the result of the database after the insert.
   
   There are no headers, but the first two numbers (100 and 100) are the 
composite key, the string is the text and the datetime is the date of the row – 
which is converted to an integer (epoch) and used  as parameter for both 
"hoodie.datasource.write.precombine.field"   and ‘hoodie.payload.ordering.field'
    
    
   As you can see below, cases 1 and 2 that deal with NEWER data, update the 
new data - this is expected.
   
   Case 3, does not update the “older data” – see that the record on the DB was 
from 10AM and the new record was for 8AM – this is great, this also the 
behavior I want.
    
   But on case4, If I try to upsert older data that belong to an OLDER 
partition – it updated the row. This is weird, I would expect cases 3 and 4 to 
behave the same.
   
   Why does the partition of the data determines if the data is updated or not?
   Why did scenario 4 DELETED the data from partition 24 and inserted on 23 - I 
mean, its great that hudi only kept one copy of each key but why the different 
behaviour of scenario 3 and 4?
   This is all running in AWS Glue with hudi 0.7
    
   CASE 1 - Inserting NEW data on the same partition
   DB:
   100  100  three  2021-06-23 10:00:00
   Row In:
   100  100  same partition  2021-06-23 10:01:00
   Result (OK):
   100  100  same partition  2021-06-23 10:01:00
    
    
    
   CASE 2 - Inserting NEW data on different partition:
   DB:
   100              100  2021-06-23 10:01:00 same partition 
   Row In:
   100             100  2021-06-24 10:01:00  dif partition
   Result (OK):
   100              100  2021-06-24 10:01:00  dif partition
    
    
    
   CASE 3 - Inserting OLD data on the same partition
   DB:
   100              100  2021-06-24 10:01:00  dif partition
   Row In:
   100             100  2021-06-24 08:00:00  old data same partition
   Result (OK):
   100              100  2021-06-24 10:01:00  dif partition
    
    
    
   CASE 4 - Inserting OLD data on different  partition
   DB:
   100              100  2021-06-24 10:01:00  dif partition
   Row In:
   100             100  2021-06-23 09:00:00  old data dif partition
   Result (BAD):
   100              100  2021-06-23 09:00:00  old data dif partition
    
   
   I am attaching the code that I am using.
   Any help would be greatly apreciated.
   Thanks very  much
   
   [hudisample.txt](https://github.com/apache/hudi/files/6924753/hudisample.txt)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to