nsivabalan commented on issue #3400:
URL: https://github.com/apache/hudi/issues/3400#issuecomment-904295065


   ok. let me try to go over how COW works and it might help you understand the 
increase in data size. 
   
   Let's say on first ingest, these are the data files created within hudi. for 
simplicity, lets assume just 1 partition. 
   
   df1_v1: 1Gb
   df2_v1: 1Gb
   df3_v1: 1Gb
   
   3 data files are created with version 1 and each is 1Gb size. 
   
   Now, lets say we we do an incremental load, (inserts + updates). Update 
records belong just to df1 and df3. 
   with COW, hudi created newer version for each data file touched. So, final 
state would be 
   // old data file versions.
   df1_v1
   df2_v1
   df3_v1
   // newer version of existing data files
   df1_v2
   df3_v2
   // new data files for new inserts
   df4_v1
   
   Now total size = (df1_v1 + df2_v1, df3_v1, df1_v2, df3_v2, df4_v1). 
   So, total disk size occupied by hudi is kind of dependent on how updates are 
applied. For instance, if there are updates to all data files (df1, df2, df3), 
every new commit will result in 2X the size. But if you have 1000s of data 
files, only some may get updated. And depending on whether you have small file 
handling enabled or not, inserts could get bin packed into existing data files 
or routed to new data files. 
   
   Wrt difference between insert_overwrite and bulk_insert w/ save mode as 
Overwrite is, hudi does cleaning async of the invalid files w/ 
insert_overwrite. in other words, files are not deleted synchronously. 
    
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to