GitHub user suryaprasanna edited a discussion: Parquet Tool Interface for File-Level Operations in Clustering
### Context: I'd like to restart the discussion around adding a parquet tool interface for file-level operations during clustering. I previously opened PR #9006 which implements this capability, and I believe this feature would be valuable for the Hudi community. ### Problem Statement Currently, Hudi's clustering strategies operate on a record-by-record basis. For certain use cases like column pruning, encryption, or selective column preservation, this approach is inefficient. These operations don't require reading and deserializing individual records - they can be performed much more efficiently at the file level using parquet-tools. ### Proposed Solution The PR introduces a ParquetToolsExecutionStrategy that enables efficient file-level operations during clustering. The implementation: - Extends SingleSparkJobExecutionStrategy to provide a framework for file-level clustering operations - Introduces HoodieFileWriteHandle for file-level operations (vs record-level) - Supports proper rollback via marker files - Enables efficient rewriting without record iteration This interface would be particularly beneficial for: 1. Column pruning - removing unnecessary columns to reduce storage costs without deserializing records 2. Encryption - applying encryption at the file level 3. Schema evolution - efficient column reordering or type changes GitHub link: https://github.com/apache/hudi/discussions/17958 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected]
