nsivabalan commented on issue #4876:
URL: https://github.com/apache/hudi/issues/4876#issuecomment-1050314304


   hey hi @Rap70r : 
   let me explain what happens behind the scenes w/ COW table. 
   
   Lets just assume we have only one partition and entire data we plan to 
ingest fits into one data file.
   
   Commit1: 
   writes data_file1_v1
   
   Commit2: updates same set of records as commit1.
   hudi writes data_file1_v2 
   // this is a new parquet file which essentially merges new incoming data w/ 
whats in data_file1_v1 and writes to data_file1_v2.
   when you query hudi at this time, only data from data_file1_v2 will be 
served.
   
   Commit3: again, updates to records from commit1.
   hudi writes data_file1_v3.
   // similar logic as commit2 above. 
   
   But hudi has a cleaner which will take care of cleaning up older file 
versions. 
   For eg, hoodie.cleaner.commits.retained is the config to play with. if you 
set this to 3. At C4, data_file1_V1 will be cleaned up. At C5, data_file1_v2 
will be cleaned up. 
   
   Now, lets take a look at how insert overwrite works. 
   Lets say you trigger insert overwrite table at C10. 
   hudi will create a new file group. data_file2_v1 with just the new incoming 
records. and mark all previous file groups as invalid. 
   So, when hudi is queried now, only data from data_file2_v1 will be served 
and nothing else. 
   
   At some later point in time, when cleaner kicks in, it will clean up all 
invalid file groups. 
   
   So, actual clean up is lazy. And thats why you may see more file keeps 
getting added w/ every commit. 
   
   Insert_overwrite_table: entire table contents will be replaced w/ current 
batch. 
   insert_overwrite: represents insert overwrite matching partitions. Lets say 
your hudi table has 1000 partition. and you are ingesting records in 100 
partitions with "insert_overwrite", only the 100 partitions will be over 
written w/ new data. rest 900 will remain intact. 
   
   Hope this clarifies things.
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to