mkk1490 edited a comment on issue #3400:
URL: https://github.com/apache/hudi/issues/3400#issuecomment-903967930


   Answers to your questions:
   1. Incremental loads are upsert records. I get a new batch of records to 
insert as well as some updates to the existing data. In the 400 GB dataset, I 
have nearly 10 billion records. 
   for each incremental I would get around 750 mn as new inserts and around 600 
mn updates. 
   2. Hudi data size  indicates the objects size on s3. That includes are 
versions of data. 6 commits
   3. Original is the data size of an external table which is currently in use. 
I’m trying to replicate the existing scenario on a hudi table.
   Existing data Size on s3  after IDL: 1.6 TB
   Same data size on hudi table after IDL: 1.3 TB
   
   Existing data size on s3 after upsert. The existing process copies the 
entire data + new data into another partition. Now the total size on disk 
becomes IDL data size + (IDL +1). This is data replication . This comes to a 
total of 2.9 TB.
    
    For hudi, I don’t need to replicate the data. I just upsert the incoming 
records from new batch. This comes to a total of 2.5 TB on s3.  I expected 1.3 
TB of IDL data size plus new data size of 500 GB to a total of < 2TB in Hudi. 
But it is almost double the size of IDL. In terms of number of records, IDL had 
30 billion records and nee batch has a little over 1 bn records. For 30 bn 
records data size is 1.3 TB in disk. For 31 bn records it is 2.3 TB. Commits 
retained: 6
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to