zhangyue19921010 commented on PR #13365: URL: https://github.com/apache/hudi/pull/13365#issuecomment-3018901372
> Sorry for the late comments. The PR has been merged. Maybe consider these points for future improvements: > > Handling schema evolution and masking columns involves writing a lot of low-level Parquet code and introduces some complexity. If we can group files by schema, then we can merge only files with the same schema, which would help avoid this complexity. > > There was a file merging functionality in [parquet-java](https://github.com/apache/parquet-java/blob/parquet-1.11.x/parquet-tools/src/main/java/org/apache/parquet/tools/command/MergeCommand.java) that we could potentially reuse. The code has been tested and used in production. This command was later removed when the entire parquet-tools was deprecated, but we could consider bringing it back. At least most of the core implementations, such as appendFile() and mergeMetadataFiles(), still exist. Thanks @shangxinli groupby schema is a great idea, which save a lot of data compatibility verification work. Will have a deep look give a new clustering plan based on that asap. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
