mkk1490 edited a comment on issue #3400:
URL: https://github.com/apache/hudi/issues/3400#issuecomment-903967930
Answers to your questions:
1. Incremental loads are upsert records. I get a new batch of records to
insert as well as some updates to the existing data. In the 400 GB dataset, I
have nearly 10 billion records.
for each incremental I would get around 750 mn as new inserts and around 600
mn updates.
2. Hudi data size indicates the objects size on s3. That includes are
versions of data. 6 commits
3. Original is the data size of an external table which is currently in use.
I’m trying to replicate the existing scenario on a hudi table.
Existing data Size on s3 after IDL: 1.6 TB
Same data size on hudi table after IDL: 1.3 TB
Existing data size on s3 after upsert. The existing process copies the
entire data + new data into another partition. Now the total size on disk
becomes IDL data size + (IDL +1). This is data replication . This comes to a
total of 2.9 TB.
For hudi, I don’t need to replicate the data. I just upsert the incoming
records from new batch. This comes to a total of 2.5 TB on s3. I expected 1.3
TB of IDL data size plus new data size of 500 GB to a total of < 2TB in Hudi.
But it is almost double the size of IDL. In terms of number of records, IDL had
30 billion records and nee batch has a little over 1 bn records. For 30 bn
records data size is 1.3 TB in disk. For 31 bn records it is 2.3 TB. Commits
retained: 6
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]