nsivabalan commented on issue #4929: URL: https://github.com/apache/hudi/issues/4929#issuecomment-1057471246
In general, small file handling logic in hudi is as follows. In Commit C1, if you end up writing files of size 10MB(filegroup1), 20MB(file group2) , 100MB(file group3) for eg. in Commit C2, if you get inserts, hudi bin packs them into file group1 and file group2 so that it can grow upto 100MB (assuming you have set max parquet file size to 100MB). Within the same batch, atleast for the first commit, hudi will not have idea on the size of the record and so it assumes avg record size as 1KB and creates the splits. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
