nsivabalan commented on issue #4876: URL: https://github.com/apache/hudi/issues/4876#issuecomment-1050314304
hey hi @Rap70r : let me explain what happens behind the scenes w/ COW table. Lets just assume we have only one partition and entire data we plan to ingest fits into one data file. Commit1: writes data_file1_v1 Commit2: updates same set of records as commit1. hudi writes data_file1_v2 // this is a new parquet file which essentially merges new incoming data w/ whats in data_file1_v1 and writes to data_file1_v2. when you query hudi at this time, only data from data_file1_v2 will be served. Commit3: again, updates to records from commit1. hudi writes data_file1_v3. // similar logic as commit2 above. But hudi has a cleaner which will take care of cleaning up older file versions. For eg, hoodie.cleaner.commits.retained is the config to play with. if you set this to 3. At C4, data_file1_V1 will be cleaned up. At C5, data_file1_v2 will be cleaned up. Now, lets take a look at how insert overwrite works. Lets say you trigger insert overwrite table at C10. hudi will create a new file group. data_file2_v1 with just the new incoming records. and mark all previous file groups as invalid. So, when hudi is queried now, only data from data_file2_v1 will be served and nothing else. At some later point in time, when cleaner kicks in, it will clean up all invalid file groups. So, actual clean up is lazy. And thats why you may see more file keeps getting added w/ every commit. Insert_overwrite_table: entire table contents will be replaced w/ current batch. insert_overwrite: represents insert overwrite matching partitions. Lets say your hudi table has 1000 partition. and you are ingesting records in 100 partitions with "insert_overwrite", only the 100 partitions will be over written w/ new data. rest 900 will remain intact. Hope this clarifies things. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
