dmenin opened a new issue #3394:
URL: https://github.com/apache/hudi/issues/3394
Hello everyone.
I have a quick question about hudi’s default behavior.
I want to understand how UPSERT works for the same key in different
scenarios.
I am using ‘GLOBAL_SIMPLE’ index, which, from my understanding, tries to
enforce uniqueness across all the partitions.
The scenario is really straight forward: based on a timestamp, I want new
data to be upserted and old data to be ignored.
The data on disk(S3) is partitioned by year\month\day so there are basically
4 scenarios:
1) Inserting NEW data on the same partition
2) Inserting NEW data on different partition
3) Inserting OLD data on the same partition
4) Inserting OLD data on different partition
Below is the result of the test on these scenarios.
It is only one row with 4 columns.
two keys (composite - always 100, 100)
one description
one timestamp (it becomes the partitions and its the sort key)
Under “DB” you see the row that was on the database (the current state of
the database);
Under “Row In” you can see the row that was read from the file and issued to
the insert statement and
under “Result” you see the result of the database after the insert.
There are no headers, but the first two numbers (100 and 100) are the
composite key, the string is the text and the datetime is the date of the row –
which is converted to an integer (epoch) and used as parameter for both
"hoodie.datasource.write.precombine.field" and ‘hoodie.payload.ordering.field'
As you can see below, cases 1 and 2 that deal with NEWER data, update the
new data - this is expected.
Case 3, does not update the “older data” – see that the record on the DB was
from 10AM and the new record was for 8AM – this is great, this also the
behavior I want.
But on case4, If I try to upsert older data that belong to an OLDER
partition – it updated the row. This is weird, I would expect cases 3 and 4 to
behave the same.
Why does the partition of the data determines if the data is updated or not?
Why did scenario 4 DELETED the data from partition 24 and inserted on 23 - I
mean, its great that hudi only kept one copy of each key but why the different
behaviour of scenario 3 and 4?
This is all running in AWS Glue with hudi 0.7
CASE 1 - Inserting NEW data on the same partition
DB:
100 100 three 2021-06-23 10:00:00
Row In:
100 100 same partition 2021-06-23 10:01:00
Result (OK):
100 100 same partition 2021-06-23 10:01:00
CASE 2 - Inserting NEW data on different partition:
DB:
100 100 2021-06-23 10:01:00 same partition
Row In:
100 100 2021-06-24 10:01:00 dif partition
Result (OK):
100 100 2021-06-24 10:01:00 dif partition
CASE 3 - Inserting OLD data on the same partition
DB:
100 100 2021-06-24 10:01:00 dif partition
Row In:
100 100 2021-06-24 08:00:00 old data same partition
Result (OK):
100 100 2021-06-24 10:01:00 dif partition
CASE 4 - Inserting OLD data on different partition
DB:
100 100 2021-06-24 10:01:00 dif partition
Row In:
100 100 2021-06-23 09:00:00 old data dif partition
Result (BAD):
100 100 2021-06-23 09:00:00 old data dif partition
I am attaching the code that I am using.
Any help would be greatly apreciated.
Thanks very much
[hudisample.txt](https://github.com/apache/hudi/files/6924753/hudisample.txt)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]