deepakpanda93 commented on issue #17570: URL: https://github.com/apache/hudi/issues/17570#issuecomment-3645431074
Hello @bithw1 Both records land in the same Parquet file because Hudi's write path is explicitly designed to reuse existing small files for new inserts, not to create a new file per commit or per record. So for a CoW table with operation=insert: 1. Index lookup & file sizing: Hudi checks existing file groups and their sizes. 2. Small file handling: If there is an existing small file in the target partition (your first insert's file slice), Hudi will route new inserts into that file group to "top it up" until it is near the max file size, instead of creating a new file group/file slice. In your example: - Two separate commits both contain inserts for the same partition and primary key space. - Hudi's small-file logic decides to reuse the existing small file group, so both commits write to the same Parquet file ID (966b7c8e-...), with different commit instants recorded in the metadata columns. This behavior is therefore expected and aligned with Hudi's design for CoW tables and insert operation. If you want in different file group you can disable small file handling by setting `hoodie.parquet.small.file.limit = -1` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
