nsivabalan opened a new pull request, #18379:
URL: https://github.com/apache/hudi/pull/18379
### Describe the issue this Pull Request addresses
This PR adds support for custom Parquet configuration injection across all
file writer factories in Apache Hudi. This feature allows users to inject
custom Parquet configurations (e.g., native Parquet bloom filters, custom
compression settings, dictionary encoding overrides) at runtime without
modifying Hudi's core code.
Motivation: Users sometimes need to apply specific Parquet configurations
for certain tables or partitions (e.g., disable dictionary encoding for
high-cardinality columns, enable native Parquet bloom filters for specific
columns, or apply custom encoding strategies). Previously, these configurations
were hard-coded or required code changes. This PR introduces a pluggable
mechanism via the HoodieParquetConfigInjector interface.
### Summary and Changelog
Summary: Added support for custom Parquet configuration injection across
Spark, Avro, and Flink file writers. Users can now implement
the HoodieParquetConfigInjector interface and specify it via the
hoodie.parquet.config.injector.class configuration to inject custom
Parquet settings at write time.
Changes:
1. Core Implementation (hudi-client-common):
- Added HoodieParquetConfigInjector interface with withProps() method
that accepts StoragePath, StorageConfiguration, and HoodieConfig
and returns modified configurations
2. Spark Integration (hudi-spark-client):
- Modified HoodieSparkFileWriterFactory.newParquetFileWriter() to check
for and invoke config injector (lines 66-79)
- Added comprehensive tests in TestHoodieParquetConfigInjector:
- testDisableDictionaryEncodingViaInjector() - validates dictionary
encoding can be disabled
- testInvalidInjectorClassThrowsException() - validates error handling
- testNoInjectorUsesDefaultConfig() - validates backward compatibility
- Tests validate actual Parquet metadata (encodings) rather than just
configuration
3. Avro Integration (hudi-hadoop-common):
- Modified HoodieAvroFileWriterFactory.newParquetFileWriter() to support
config injection (lines 71-85)
- Updated getHoodieAvroWriteSupport() signature to accept
StorageConfiguration
- Added TestHoodieAvroParquetConfigInjector with similar test coverage
4. Flink Integration (hudi-flink-client):
- Modified HoodieRowDataFileWriterFactory.newParquetFileWriter() to
support config injection (lines 126-140)
- Added TestHoodieRowDataParquetConfigInjector with similar test coverage
5. Configuration:
- Added HOODIE_PARQUET_CONFIG_INJECTOR_CLASS config key in
HoodieStorageConfig
- Added withParquetConfigInjectorClass() builder method
Example Usage:
public class CustomInjector implements HoodieParquetConfigInjector {
@Override
public Pair<StorageConfiguration, HoodieConfig> withProps(
StoragePath path, StorageConfiguration storageConf, HoodieConfig
hoodieConfig) {
// Disable dictionary for high-cardinality partitions
if (path.toString().contains("high_cardinality")) {
hoodieConfig.setValue(HoodieStorageConfig.PARQUET_DICTIONARY_ENABLED, "false");
}
return Pair.of(storageConf, hoodieConfig);
}
}
// Usage in HoodieWriteConfig
HoodieWriteConfig config = HoodieWriteConfig.newBuilder()
.withStorageConfig(HoodieStorageConfig.newBuilder()
.withParquetConfigInjectorClass("com.example.CustomInjector")
.build())
.build();
### Impact
Public API:
- New interface: HoodieParquetConfigInjector (marked with appropriate
annotations)
- New configuration: hoodie.parquet.config.injector.class (optional,
defaults to empty string)
User-facing changes:
- Users can now customize Parquet settings per file/partition without
modifying Hudi code
- Fully backward compatible - existing code continues to work without
changes
- No performance impact when feature is not used (single isNullOrEmpty()
check)
Use cases enabled:
1. Selective dictionary encoding based on partition characteristics
2. Native Parquet bloom filter configuration for specific columns
3. Custom compression strategies per table/partition
4. Dynamic Parquet settings based on file path patterns
### Risk Level
Low
### Documentation Update
1. Configuration Documentation: Update Hudi configuration docs to add:
hoodie.parquet.config.injector.class (optional, default: "")
- Fully qualified class name of HoodieParquetConfigInjector implementation
- Allows custom Parquet configuration injection at write time
- Class must implement org.apache.hudi.io.HoodieParquetConfigInjector
interface
- Example: "com.example.CustomParquetConfigInjector"
2. User Guide: Add section on "Advanced Parquet Configuration" showing:
- How to implement HoodieParquetConfigInjector interface
- Example injector implementations (dictionary encoding, bloom filters)
- How to configure and use custom injectors
- Common use cases and patterns
3. JavaDoc: Interface and methods are already documented inline
### Contributor's checklist
- [ ] Read through [contributor's
guide](https://hudi.apache.org/contribute/how-to-contribute)
- [ ] Enough context is provided in the sections above
- [ ] Adequate tests were added if applicable
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]