deepakpanda93 commented on issue #17570: URL: https://github.com/apache/hudi/issues/17570#issuecomment-3646658673
Hello @bithw1 - In Hudi’s design, even if you issue two separate INSERTs, it is not guaranteed that each commit will produce a separate parquet file. Hudi’s write path will try to pack records efficiently into files up to the configured target file size, and the file assignment may result in writes landing in the same file slice across commits under certain workloads. - During an insert and upsert operation, we opportunistically expand existing small files on storage, instead of writing new files, to keep number of files to an optimum. This config sets the file size limit below which a file on storage becomes a candidate to be selected as such a small file. By default, treat any file <= 100MB as a small file. Also note that if this set <= 0, will not try to get small files and directly write new files - I tried with the above config and here is the result. After setting `hoodie.parquet.small.file.limit=-1`, it has created a new file for the new insert. ``` spark-sql (default)> insert into hudi_cow_20251212 select 1,1,1; Time taken: 1.443 seconds spark-sql (default)> select * from hudi_cow_20251212; 20251212141026106 20251212141026106_0_0 1 e4bd341c-ce68-42d7-97c5-be4b8986e773-0_0-8-7_20251212141026106.parquet 1 1 1 Time taken: 0.114 seconds, Fetched 1 row(s) spark-sql (default)> insert into hudi_cow_20251212 select 1,11,111; Time taken: 1.341 seconds spark-sql (default)> select * from hudi_cow_20251212; 20251212141026106 20251212141026106_0_0 1 e4bd341c-ce68-42d7-97c5-be4b8986e773-0_0-21-17_20251212141046831.parquet 1 1 1 20251212141046831 20251212141046831_0_1 1 e4bd341c-ce68-42d7-97c5-be4b8986e773-0_0-21-17_20251212141046831.parquet 1 11 111 Time taken: 0.098 seconds, Fetched 2 row(s) spark-sql (default)> insert into hudi_cow_20251212 select 1,22,222; Time taken: 0.861 seconds spark-sql (default)> select * from hudi_cow_20251212; 20251212141026106 20251212141026106_0_0 1 e4bd341c-ce68-42d7-97c5-be4b8986e773-0_0-34-27_20251212141108363.parquet 1 1 1 20251212141046831 20251212141046831_0_1 1 e4bd341c-ce68-42d7-97c5-be4b8986e773-0_0-34-27_20251212141108363.parquet 1 11 111 20251212141108363 20251212141108363_0_2 1 e4bd341c-ce68-42d7-97c5-be4b8986e773-0_0-34-27_20251212141108363.parquet 1 22 222 Time taken: 0.119 seconds, Fetched 3 row(s) spark-sql (default)> set hoodie.parquet.small.file.limit=-1; hoodie.parquet.small.file.limit -1 Time taken: 0.013 seconds, Fetched 1 row(s) spark-sql (default)> insert into hudi_cow_20251212 select 1,33,333; Time taken: 0.806 seconds spark-sql (default)> select * from hudi_cow_20251212; 20251212141026106 20251212141026106_0_0 1 e4bd341c-ce68-42d7-97c5-be4b8986e773-0_0-34-27_20251212141108363.parquet 1 1 1 20251212141046831 20251212141046831_0_1 1 e4bd341c-ce68-42d7-97c5-be4b8986e773-0_0-34-27_20251212141108363.parquet 1 11 111 20251212141108363 20251212141108363_0_2 1 e4bd341c-ce68-42d7-97c5-be4b8986e773-0_0-34-27_20251212141108363.parquet 1 22 222 20251212141139030 20251212141139030_0_0 1 a80b6473-6eb1-4d3f-9c1f-dd344d7d21f6-0_0-46-36_20251212141139030.parquet 1 33 333 Time taken: 0.086 seconds, Fetched 4 row(s) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
