yihua commented on code in PR #18379:
URL: https://github.com/apache/hudi/pull/18379#discussion_r3036297245
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -2400,6 +2400,10 @@ public boolean parquetBloomFilterEnabled() {
return
getBooleanOrDefault(HoodieStorageConfig.PARQUET_WITH_BLOOM_FILTER_ENABLED);
}
+ public Option<String> getParquetConfigInjectorClass() {
+ return
Option.ofNullable(getString(HoodieStorageConfig.HOODIE_PARQUET_CONFIG_INJECTOR_CLASS));
+ }
Review Comment:
<a href="#"><img alt="P2"
src="https://greptile-static-assets.s3.amazonaws.com/badges/p2.svg?v=7"
align="top"></a> **Unused accessor method**
`getParquetConfigInjectorClass()` returns `Option<String>` but is never
called anywhere in the codebase. All three writer factories
(`HoodieAvroFileWriterFactory`, `HoodieRowDataFileWriterFactory`,
`HoodieSparkFileWriterFactory`) read the config directly via
`config.getStringOrDefault(HoodieStorageConfig.HOODIE_PARQUET_CONFIG_INJECTOR_CLASS,
StringUtils.EMPTY_STRING)`. The method should either be used consistently by
the factories or removed to avoid confusion.
— *Greptile*
([original](https://github.com/yihua/hudi/pull/13#discussion_r3036292446))
(source:comment#3036292446)
##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/HoodieSparkFileWriterFactory.java:
##########
@@ -57,16 +62,32 @@ protected HoodieFileWriter newParquetFileWriter(
if (compressionCodecName.isEmpty()) {
compressionCodecName = null;
}
- HoodieRowParquetWriteSupport writeSupport =
getHoodieRowParquetWriteSupport(storage.getConf(), schema,
- config, enableBloomFilter(populateMetaFields, config));
+
+ String configInjectorClass =
config.getStringOrDefault(HoodieStorageConfig.HOODIE_PARQUET_CONFIG_INJECTOR_CLASS,
StringUtils.EMPTY_STRING);
+
+ StorageConfiguration storageConfiguration = storage.getConf();
+ HoodieConfig hoodieConfig = config;
+ if (!StringUtils.isNullOrEmpty(configInjectorClass)) {
+ try {
+ HoodieParquetConfigInjector injector = (HoodieParquetConfigInjector)
ReflectionUtils.loadClass(configInjectorClass);
+ Pair<StorageConfiguration, HoodieConfig> modifiedConfigs =
injector.withProps(path, storageConfiguration, hoodieConfig);
+ storageConfiguration = modifiedConfigs.getLeft();
+ hoodieConfig = modifiedConfigs.getRight();
Review Comment:
_⚠️ Potential issue_ | _🟠 Major_
**Don't pass the live config objects into the injector.**
`storageConfiguration` and `hoodieConfig` are just aliases for
`storage.getConf()` and the caller-supplied `config`. The new
`DisableDictionaryInjector` test mutates `hoodieConfig` in place, so creating
one writer can silently rewrite the caller-owned config as a side effect, and
any later writer that reuses it will inherit the previous file's overrides.
Invoke the injector on defensive copies and only use the returned instances for
this writer. The same aliasing pattern is repeated in
`HoodieRowDataFileWriterFactory` and `HoodieAvroFileWriterFactory`.
<details>
<summary>🤖 Prompt for AI Agents</summary>
```
Verify each finding against the current code and only fix it if needed.
In
`@hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/HoodieSparkFileWriterFactory.java`
around lines 68 - 75, The code currently passes live aliases
storageConfiguration (from storage.getConf()) and hoodieConfig (from config)
into the injector, allowing in-place mutation of caller-owned objects; change
HoodieSparkFileWriterFactory so that before calling
ReflectionUtils.loadClass(configInjectorClass) and invoking
HoodieParquetConfigInjector.withProps(path, ...), you pass defensive copies
(e.g., clone or new instances built from storage.getConf() and config) rather
than the original storageConfiguration and hoodieConfig, then assign only the
returned Pair's left/right to the writer-local variables; apply the same
defensive-copy fix to the identical aliasing in
HoodieRowDataFileWriterFactory
and HoodieAvroFileWriterFactory so the injector cannot mutate caller state.
```
</details>
<!--
fingerprinting:phantom:medusa:grasshopper:49991f1a-3102-43c3-86ca-eab20a52597e
-->
<!-- This is an auto-generated comment by CodeRabbit -->
— *CodeRabbit*
([original](https://github.com/yihua/hudi/pull/13#discussion_r3036297064))
(source:comment#3036297064)
##########
hudi-hadoop-common/src/main/java/org/apache/hudi/io/storage/hadoop/HoodieAvroFileWriterFactory.java:
##########
@@ -62,26 +67,43 @@ protected HoodieFileWriter newParquetFileWriter(
String instantTime, StoragePath path, HoodieConfig config, HoodieSchema
schema,
TaskContextSupplier taskContextSupplier) throws IOException {
boolean populateMetaFields =
config.getBooleanOrDefault(HoodieTableConfig.POPULATE_META_FIELDS);
- HoodieAvroWriteSupport writeSupport = getHoodieAvroWriteSupport(schema,
config, enableBloomFilter(populateMetaFields, config));
- String compressionCodecName =
config.getStringOrDefault(HoodieStorageConfig.PARQUET_COMPRESSION_CODEC_NAME);
+ // Support for custom Parquet config injector
+ String configInjectorClass =
config.getStringOrDefault(HoodieStorageConfig.HOODIE_PARQUET_CONFIG_INJECTOR_CLASS,
StringUtils.EMPTY_STRING);
+
+ StorageConfiguration storageConfiguration = storage.getConf();
+ HoodieConfig hoodieConfig = config;
+ if (!StringUtils.isNullOrEmpty(configInjectorClass)) {
+ try {
+ HoodieParquetConfigInjector injector = (HoodieParquetConfigInjector)
ReflectionUtils.loadClass(configInjectorClass);
+ Pair<StorageConfiguration, HoodieConfig> modifiedConfigs =
injector.withProps(path, storageConfiguration, hoodieConfig);
+ storageConfiguration = modifiedConfigs.getLeft();
+ hoodieConfig = modifiedConfigs.getRight();
+ } catch (Exception e) {
+ throw new HoodieException("Failed to instantiate or invoke parquet
config injector class: " + configInjectorClass, e);
+ }
+ }
Review Comment:
<a href="#"><img alt="P1"
src="https://greptile-static-assets.s3.amazonaws.com/badges/p1.svg?v=7"
align="top"></a> **Injector receives—and can mutate—the original shared config**
`HoodieConfig hoodieConfig = config;` is a reference assignment, not a copy.
When the injector calls `hoodieConfig.setValue(...)` (as all three
test-supplied implementations do, e.g. `DisableDictionaryInjector`), it mutates
the exact same `Properties` object that backs `config`. Because
`HoodieWriteConfig` is typically shared across all file-write handles within a
single write operation, this means:
1. The first file written by the injector permanently alters the shared
config seen by every subsequent file in the same operation.
2. Path-specific overrides (the interface signature accepts `StoragePath`
suggesting this intent) will "leak" to unintended files.
The same problem exists in `HoodieRowDataFileWriterFactory` (line 130) and
`HoodieSparkFileWriterFactory` (line 69).
A defensive copy should be made before handing the config to user-supplied
code:
```java
HoodieConfig hoodieConfig = new HoodieConfig(config.getProps());
```
The same fix is needed in all three factory classes.
— *Greptile*
([original](https://github.com/yihua/hudi/pull/13#discussion_r3036292437))
(source:comment#3036292437)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]