nsivabalan opened a new pull request, #18379:
URL: https://github.com/apache/hudi/pull/18379

   ### Describe the issue this Pull Request addresses
   
   This PR adds support for custom Parquet configuration injection across all 
file writer factories in Apache Hudi. This feature allows users to inject 
custom Parquet configurations (e.g., native Parquet bloom filters, custom 
compression settings, dictionary encoding overrides) at runtime without 
modifying Hudi's core code.
   
   Motivation: Users sometimes need to apply specific Parquet configurations 
for certain tables or partitions (e.g., disable dictionary encoding for 
high-cardinality columns, enable native Parquet bloom filters for specific 
columns, or apply custom encoding strategies). Previously, these configurations 
were hard-coded or required code changes. This PR introduces a pluggable 
mechanism via the HoodieParquetConfigInjector interface.
   
   ### Summary and Changelog
   
   Summary: Added support for custom Parquet configuration injection across 
Spark, Avro, and Flink file writers. Users can now implement
     the HoodieParquetConfigInjector interface and specify it via the 
hoodie.parquet.config.injector.class configuration to inject custom
     Parquet settings at write time.
   
     Changes:
   
     1. Core Implementation (hudi-client-common):
       - Added HoodieParquetConfigInjector interface with withProps() method 
that accepts StoragePath, StorageConfiguration, and HoodieConfig
      and returns modified configurations
     2. Spark Integration (hudi-spark-client):
       - Modified HoodieSparkFileWriterFactory.newParquetFileWriter() to check 
for and invoke config injector (lines 66-79)
       - Added comprehensive tests in TestHoodieParquetConfigInjector:
           - testDisableDictionaryEncodingViaInjector() - validates dictionary 
encoding can be disabled
         - testInvalidInjectorClassThrowsException() - validates error handling
         - testNoInjectorUsesDefaultConfig() - validates backward compatibility
       - Tests validate actual Parquet metadata (encodings) rather than just 
configuration
     3. Avro Integration (hudi-hadoop-common):
       - Modified HoodieAvroFileWriterFactory.newParquetFileWriter() to support 
config injection (lines 71-85)
       - Updated getHoodieAvroWriteSupport() signature to accept 
StorageConfiguration
       - Added TestHoodieAvroParquetConfigInjector with similar test coverage
     4. Flink Integration (hudi-flink-client):
       - Modified HoodieRowDataFileWriterFactory.newParquetFileWriter() to 
support config injection (lines 126-140)
       - Added TestHoodieRowDataParquetConfigInjector with similar test coverage
     5. Configuration:
       - Added HOODIE_PARQUET_CONFIG_INJECTOR_CLASS config key in 
HoodieStorageConfig
       - Added withParquetConfigInjectorClass() builder method
   
     Example Usage:
     public class CustomInjector implements HoodieParquetConfigInjector {
       @Override
       public Pair<StorageConfiguration, HoodieConfig> withProps(
           StoragePath path, StorageConfiguration storageConf, HoodieConfig 
hoodieConfig) {
         // Disable dictionary for high-cardinality partitions
         if (path.toString().contains("high_cardinality")) {
           
hoodieConfig.setValue(HoodieStorageConfig.PARQUET_DICTIONARY_ENABLED, "false");
         }
         return Pair.of(storageConf, hoodieConfig);
       }
     }
   
     // Usage in HoodieWriteConfig
     HoodieWriteConfig config = HoodieWriteConfig.newBuilder()
       .withStorageConfig(HoodieStorageConfig.newBuilder()
         .withParquetConfigInjectorClass("com.example.CustomInjector")
         .build())
       .build();
   
   ### Impact
   
    Public API:
     - New interface: HoodieParquetConfigInjector (marked with appropriate 
annotations)
     - New configuration: hoodie.parquet.config.injector.class (optional, 
defaults to empty string)
   
     User-facing changes:
     - Users can now customize Parquet settings per file/partition without 
modifying Hudi code
     - Fully backward compatible - existing code continues to work without 
changes
     - No performance impact when feature is not used (single isNullOrEmpty() 
check)
   
     Use cases enabled:
     1. Selective dictionary encoding based on partition characteristics
     2. Native Parquet bloom filter configuration for specific columns
     3. Custom compression strategies per table/partition
     4. Dynamic Parquet settings based on file path patterns
   
   ### Risk Level
   
   Low
   
   ### Documentation Update
   
    1. Configuration Documentation: Update Hudi configuration docs to add:
     hoodie.parquet.config.injector.class (optional, default: "")
     - Fully qualified class name of HoodieParquetConfigInjector implementation
     - Allows custom Parquet configuration injection at write time
     - Class must implement org.apache.hudi.io.HoodieParquetConfigInjector 
interface
     - Example: "com.example.CustomParquetConfigInjector"
     2. User Guide: Add section on "Advanced Parquet Configuration" showing:
       - How to implement HoodieParquetConfigInjector interface
       - Example injector implementations (dictionary encoding, bloom filters)
       - How to configure and use custom injectors
       - Common use cases and patterns
     3. JavaDoc: Interface and methods are already documented inline
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Enough context is provided in the sections above
   - [ ] Adequate tests were added if applicable
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to