Re: [I] Two records inserted by two insert into operations go to the same parquet file [hudi]

via GitHub Fri, 12 Dec 2025 06:19:00 -0800


deepakpanda93 commented on issue #17570:
URL: https://github.com/apache/hudi/issues/17570#issuecomment-3646658673


   Hello @bithw1 
   
   - In Hudi’s design, even if you issue two separate INSERTs, it is not 
guaranteed that each commit will produce a separate parquet file. Hudi’s write 
path will try to pack records efficiently into files up to the configured 
target file size, and the file assignment may result in writes landing in the 
same file slice across commits under certain workloads.
   
   - During an insert and upsert operation, we opportunistically expand 
existing small files on storage, instead of writing new files, to keep number 
of files to an optimum. This config sets the file size limit below which a file 
on storage becomes a candidate to be selected as such a small file. By default, 
treat any file <= 100MB as a small file. Also note that if this set <= 0, will 
not try to get small files and directly write new files
   
   - I tried with the above config and here is the result. After setting 
`hoodie.parquet.small.file.limit=-1`, it has created a new file for the new 
insert.
   ```
   spark-sql (default)> insert into hudi_cow_20251212 select 1,1,1;
   Time taken: 1.443 seconds
   
   spark-sql (default)> select * from hudi_cow_20251212;
   20251212141026106       20251212141026106_0_0   1               
e4bd341c-ce68-42d7-97c5-be4b8986e773-0_0-8-7_20251212141026106.parquet  1       
1       1
   Time taken: 0.114 seconds, Fetched 1 row(s)
   
   spark-sql (default)> insert into hudi_cow_20251212 select 1,11,111;
   Time taken: 1.341 seconds
   
   spark-sql (default)> select * from hudi_cow_20251212;
   20251212141026106       20251212141026106_0_0   1               
e4bd341c-ce68-42d7-97c5-be4b8986e773-0_0-21-17_20251212141046831.parquet        
1       1       1
   20251212141046831       20251212141046831_0_1   1               
e4bd341c-ce68-42d7-97c5-be4b8986e773-0_0-21-17_20251212141046831.parquet        
1       11      111
   Time taken: 0.098 seconds, Fetched 2 row(s)
   
   spark-sql (default)> insert into hudi_cow_20251212 select 1,22,222;
   Time taken: 0.861 seconds
   
   spark-sql (default)> select * from hudi_cow_20251212;
   20251212141026106       20251212141026106_0_0   1               
e4bd341c-ce68-42d7-97c5-be4b8986e773-0_0-34-27_20251212141108363.parquet        
1       1       1
   20251212141046831       20251212141046831_0_1   1               
e4bd341c-ce68-42d7-97c5-be4b8986e773-0_0-34-27_20251212141108363.parquet        
1       11      111
   20251212141108363       20251212141108363_0_2   1               
e4bd341c-ce68-42d7-97c5-be4b8986e773-0_0-34-27_20251212141108363.parquet        
1       22      222
   Time taken: 0.119 seconds, Fetched 3 row(s)
   
   spark-sql (default)> set hoodie.parquet.small.file.limit=-1;
   hoodie.parquet.small.file.limit -1
   Time taken: 0.013 seconds, Fetched 1 row(s)
   
   spark-sql (default)> insert into hudi_cow_20251212 select 1,33,333;
   Time taken: 0.806 seconds
   
   spark-sql (default)> select * from hudi_cow_20251212;
   20251212141026106       20251212141026106_0_0   1               
e4bd341c-ce68-42d7-97c5-be4b8986e773-0_0-34-27_20251212141108363.parquet        
1       1       1
   20251212141046831       20251212141046831_0_1   1               
e4bd341c-ce68-42d7-97c5-be4b8986e773-0_0-34-27_20251212141108363.parquet        
1       11      111
   20251212141108363       20251212141108363_0_2   1               
e4bd341c-ce68-42d7-97c5-be4b8986e773-0_0-34-27_20251212141108363.parquet        
1       22      222
   20251212141139030       20251212141139030_0_0   1               
a80b6473-6eb1-4d3f-9c1f-dd344d7d21f6-0_0-46-36_20251212141139030.parquet        
1       33      333
   Time taken: 0.086 seconds, Fetched 4 row(s)
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Two records inserted by two insert into operations go to the same parquet file [hudi]

Reply via email to