nsivabalan commented on issue #3400:
URL: https://github.com/apache/hudi/issues/3400#issuecomment-903878681


   @mkk1490 : sorry for late turn around. been busy with work. 
   Few clarifying questions.
   1. What do you mean by incremental loads? 
   ```
   Original data size = 414 GB and Hudi data size = 424 GB
   After 4 incremental loads, original came around 2100 GB with Snapshot data 
in each partition and Hudi data size was 547 GB
   I tested for another scenario with >1.5 TB with the same process as above
   ```
   In this, does incremental load mean, you took your original dataset and made 
some updates to some of the records and did an upsert to hudi? Or without doing 
any updates, you ingested into hudi? 
   Also, is 2100Gb = (414 Gb * 5) i.e one version of original and 4 incremental 
copies? 
   
   2. also, by Hudi data size, you mean on disk size? (which includes all 
versions of data) or just the snapshot data size? 
   
   3. 
   ```
   Original = 1.6 TB, Hudi 1.3 TB
   After 1 incremental, orig = 2.9 TB and Hudi 2.5 TB
   ```
   This confuses me as to whats incremental. In previous example, I thought you 
took entire original data and made another copy w/ some updates. but here i see 
only 1.3Tb added to 1.6Tb. Or in general, incremental means, updates to some 
records from original snapshot. 
   
   Once I get a clarification on these, will respond to your questions. 
   @codope : will look into the delete issue w/ bulk_insert. 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to