yihua commented on code in PR #18379:
URL: https://github.com/apache/hudi/pull/18379#discussion_r3036297245


##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -2400,6 +2400,10 @@ public boolean parquetBloomFilterEnabled() {
     return 
getBooleanOrDefault(HoodieStorageConfig.PARQUET_WITH_BLOOM_FILTER_ENABLED);
   }
 
+  public Option<String> getParquetConfigInjectorClass() {
+    return 
Option.ofNullable(getString(HoodieStorageConfig.HOODIE_PARQUET_CONFIG_INJECTOR_CLASS));
+  }

Review Comment:
   <a href="#"><img alt="P2" 
src="https://greptile-static-assets.s3.amazonaws.com/badges/p2.svg?v=7"; 
align="top"></a> **Unused accessor method**
   
   `getParquetConfigInjectorClass()` returns `Option<String>` but is never 
called anywhere in the codebase. All three writer factories 
(`HoodieAvroFileWriterFactory`, `HoodieRowDataFileWriterFactory`, 
`HoodieSparkFileWriterFactory`) read the config directly via 
`config.getStringOrDefault(HoodieStorageConfig.HOODIE_PARQUET_CONFIG_INJECTOR_CLASS,
 StringUtils.EMPTY_STRING)`. The method should either be used consistently by 
the factories or removed to avoid confusion.
   
   — *Greptile* 
([original](https://github.com/yihua/hudi/pull/13#discussion_r3036292446)) 
(source:comment#3036292446)



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/HoodieSparkFileWriterFactory.java:
##########
@@ -57,16 +62,32 @@ protected HoodieFileWriter newParquetFileWriter(
     if (compressionCodecName.isEmpty()) {
       compressionCodecName = null;
     }
-    HoodieRowParquetWriteSupport writeSupport = 
getHoodieRowParquetWriteSupport(storage.getConf(), schema,
-        config, enableBloomFilter(populateMetaFields, config));
+
+    String configInjectorClass = 
config.getStringOrDefault(HoodieStorageConfig.HOODIE_PARQUET_CONFIG_INJECTOR_CLASS,
 StringUtils.EMPTY_STRING);
+
+    StorageConfiguration storageConfiguration = storage.getConf();
+    HoodieConfig hoodieConfig = config;
+    if (!StringUtils.isNullOrEmpty(configInjectorClass)) {
+      try {
+        HoodieParquetConfigInjector injector = (HoodieParquetConfigInjector) 
ReflectionUtils.loadClass(configInjectorClass);
+        Pair<StorageConfiguration, HoodieConfig> modifiedConfigs = 
injector.withProps(path, storageConfiguration, hoodieConfig);
+        storageConfiguration = modifiedConfigs.getLeft();
+        hoodieConfig = modifiedConfigs.getRight();

Review Comment:
   _⚠️ Potential issue_ | _🟠 Major_
   
   **Don't pass the live config objects into the injector.**
   
   `storageConfiguration` and `hoodieConfig` are just aliases for 
`storage.getConf()` and the caller-supplied `config`. The new 
`DisableDictionaryInjector` test mutates `hoodieConfig` in place, so creating 
one writer can silently rewrite the caller-owned config as a side effect, and 
any later writer that reuses it will inherit the previous file's overrides. 
Invoke the injector on defensive copies and only use the returned instances for 
this writer. The same aliasing pattern is repeated in 
`HoodieRowDataFileWriterFactory` and `HoodieAvroFileWriterFactory`.
   
   <details>
   <summary>🤖 Prompt for AI Agents</summary>
   
   ```
   Verify each finding against the current code and only fix it if needed.
   
   In
   
`@hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/HoodieSparkFileWriterFactory.java`
   around lines 68 - 75, The code currently passes live aliases
   storageConfiguration (from storage.getConf()) and hoodieConfig (from config)
   into the injector, allowing in-place mutation of caller-owned objects; change
   HoodieSparkFileWriterFactory so that before calling
   ReflectionUtils.loadClass(configInjectorClass) and invoking
   HoodieParquetConfigInjector.withProps(path, ...), you pass defensive copies
   (e.g., clone or new instances built from storage.getConf() and config) rather
   than the original storageConfiguration and hoodieConfig, then assign only the
   returned Pair's left/right to the writer-local variables; apply the same
   defensive-copy fix to the identical aliasing in 
HoodieRowDataFileWriterFactory
   and HoodieAvroFileWriterFactory so the injector cannot mutate caller state.
   ```
   
   </details>
   
   <!-- 
fingerprinting:phantom:medusa:grasshopper:49991f1a-3102-43c3-86ca-eab20a52597e 
-->
   
   <!-- This is an auto-generated comment by CodeRabbit -->
   
   — *CodeRabbit* 
([original](https://github.com/yihua/hudi/pull/13#discussion_r3036297064)) 
(source:comment#3036297064)



##########
hudi-hadoop-common/src/main/java/org/apache/hudi/io/storage/hadoop/HoodieAvroFileWriterFactory.java:
##########
@@ -62,26 +67,43 @@ protected HoodieFileWriter newParquetFileWriter(
       String instantTime, StoragePath path, HoodieConfig config, HoodieSchema 
schema,
       TaskContextSupplier taskContextSupplier) throws IOException {
     boolean populateMetaFields = 
config.getBooleanOrDefault(HoodieTableConfig.POPULATE_META_FIELDS);
-    HoodieAvroWriteSupport writeSupport = getHoodieAvroWriteSupport(schema, 
config, enableBloomFilter(populateMetaFields, config));
 
-    String compressionCodecName = 
config.getStringOrDefault(HoodieStorageConfig.PARQUET_COMPRESSION_CODEC_NAME);
+    // Support for custom Parquet config injector
+    String configInjectorClass = 
config.getStringOrDefault(HoodieStorageConfig.HOODIE_PARQUET_CONFIG_INJECTOR_CLASS,
 StringUtils.EMPTY_STRING);
+
+    StorageConfiguration storageConfiguration = storage.getConf();
+    HoodieConfig hoodieConfig = config;
+    if (!StringUtils.isNullOrEmpty(configInjectorClass)) {
+      try {
+        HoodieParquetConfigInjector injector = (HoodieParquetConfigInjector) 
ReflectionUtils.loadClass(configInjectorClass);
+        Pair<StorageConfiguration, HoodieConfig> modifiedConfigs = 
injector.withProps(path, storageConfiguration, hoodieConfig);
+        storageConfiguration = modifiedConfigs.getLeft();
+        hoodieConfig = modifiedConfigs.getRight();
+      } catch (Exception e) {
+        throw new HoodieException("Failed to instantiate or invoke parquet 
config injector class: " + configInjectorClass, e);
+      }
+    }

Review Comment:
   <a href="#"><img alt="P1" 
src="https://greptile-static-assets.s3.amazonaws.com/badges/p1.svg?v=7"; 
align="top"></a> **Injector receives—and can mutate—the original shared config**
   
   `HoodieConfig hoodieConfig = config;` is a reference assignment, not a copy. 
When the injector calls `hoodieConfig.setValue(...)` (as all three 
test-supplied implementations do, e.g. `DisableDictionaryInjector`), it mutates 
the exact same `Properties` object that backs `config`. Because 
`HoodieWriteConfig` is typically shared across all file-write handles within a 
single write operation, this means:
   
   1. The first file written by the injector permanently alters the shared 
config seen by every subsequent file in the same operation.
   2. Path-specific overrides (the interface signature accepts `StoragePath` 
suggesting this intent) will "leak" to unintended files.
   
   The same problem exists in `HoodieRowDataFileWriterFactory` (line 130) and 
`HoodieSparkFileWriterFactory` (line 69).
   
   A defensive copy should be made before handing the config to user-supplied 
code:
   
   ```java
   HoodieConfig hoodieConfig = new HoodieConfig(config.getProps());
   ```
   
   The same fix is needed in all three factory classes.
   
   — *Greptile* 
([original](https://github.com/yihua/hudi/pull/13#discussion_r3036292437)) 
(source:comment#3036292437)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to