Re: [I] Two records inserted by two insert into operations go to the same parquet file [hudi]

via GitHub Fri, 12 Dec 2025 00:21:20 -0800


deepakpanda93 commented on issue #17570:
URL: https://github.com/apache/hudi/issues/17570#issuecomment-3645431074


   Hello @bithw1 
   
   Both records land in the same Parquet file because Hudi's write path is 
explicitly designed to reuse existing small files for new inserts, not to 
create a new file per commit or per record.
   
   So for a CoW table with operation=insert:
   
   1. Index lookup & file sizing: Hudi checks existing file groups and their 
sizes.
   2. Small file handling: If there is an existing small file in the target 
partition (your first insert's file slice), Hudi will route new inserts into 
that file group to "top it up" until it is near the max file size, instead of 
creating a new file group/file slice.
   
   In your example:
   
   - Two separate commits both contain inserts for the same partition and 
primary key space.
   - Hudi's small-file logic decides to reuse the existing small file group, so 
both commits write to the same Parquet file ID (966b7c8e-...), with different 
commit instants recorded in the metadata columns.
   
   This behavior is therefore expected and aligned with Hudi's design for CoW 
tables and insert operation.
   If you want in different file group you can disable small file handling by 
setting `hoodie.parquet.small.file.limit = -1`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Two records inserted by two insert into operations go to the same parquet file [hudi]

Reply via email to