[GitHub] [hudi] nsivabalan edited a comment on issue #4929: [SUPPORT] SparkSession To Hudi Small files are not merged

GitBox Wed, 02 Mar 2022 14:50:28 -0800


nsivabalan edited a comment on issue #4929:
URL: https://github.com/apache/hudi/issues/4929#issuecomment-1057471246



   In general, small file handling logic in hudi is as follows. 
   In Commit C1, if you end up writing files of size 10MB(filegroup1), 
20MB(file group2) , 100MB(file group3) for eg.
   in Commit C2, if you get inserts, hudi bin packs them into file group1 and 
file group2 so that it can grow upto 100MB (assuming you have set max parquet 
file size to 100MB). 
   
   Within the same batch, atleast for the first commit, hudi will not have idea 
on the size of the record and so it assumes avg record size as 1KB and creates 
the splits. 
   may be, thats the reason you are seeing lot of 6.5 MB files. another reason 
could be, you are using too much of parallelism, and so every spark task gets 
only 6.5MB worth of data. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] nsivabalan edited a comment on issue #4929: [SUPPORT] SparkSession To Hudi Small files are not merged

Reply via email to