Re: [PR] [HUDI-9468] Parquet Binary Copy at Rowgroup Level [hudi]

via GitHub Mon, 30 Jun 2025 05:06:45 -0700


zhangyue19921010 commented on PR #13365:
URL: https://github.com/apache/hudi/pull/13365#issuecomment-3018901372


   > Sorry for the late comments. The PR has been merged. Maybe consider these 
points for future improvements:
   > 
   > Handling schema evolution and masking columns involves writing a lot of 
low-level Parquet code and introduces some complexity. If we can group files by 
schema, then we can merge only files with the same schema, which would help 
avoid this complexity.
   > 
   > There was a file merging functionality in 
[parquet-java](https://github.com/apache/parquet-java/blob/parquet-1.11.x/parquet-tools/src/main/java/org/apache/parquet/tools/command/MergeCommand.java)
 that we could potentially reuse. The code has been tested and used in 
production. This command was later removed when the entire parquet-tools was 
deprecated, but we could consider bringing it back. At least most of the core 
implementations, such as appendFile() and mergeMetadataFiles(), still exist.
   
   Thanks @shangxinli groupby schema is a great idea, which save a lot of data 
compatibility verification work. Will have a deep
    look give a new clustering plan based on that asap.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [HUDI-9468] Parquet Binary Copy at Rowgroup Level [hudi]

Reply via email to