Re: [PR] feat: Use PartitionValueExtractor interface in Spark reader path [hudi]

via GitHub Thu, 12 Feb 2026 07:12:58 -0800


nsivabalan commented on code in PR #17850:
URL: https://github.com/apache/hudi/pull/17850#discussion_r2799405795



##########
hudi-sync/hudi-sync-common/src/main/java/org/apache/hudi/sync/common/HoodieSyncConfig.java:
##########
@@ -127,19 +125,17 @@ public class HoodieSyncConfig extends HoodieConfig {
           partitionFieldsOpt = HoodieTableConfig.getPartitionFieldProp(cfg)
               .or(() -> 
Option.ofNullable(cfg.getString(KeyGeneratorOptions.PARTITIONPATH_FIELD_NAME)));
         }
+
         if (!partitionFieldsOpt.isPresent()) {
-          return Option.empty();
+          return Option.of("org.apache.hudi.hive.NonPartitionedExtractor");

Review Comment:
   oh, I think its better to just have infer func only for table config. 
   and the infer func here should just take it from tableConfig. 
   that would simplify things



##########
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java:
##########
@@ -657,6 +658,13 @@ public class FlinkOptions extends HoodieConfig {
           + "**Note** This is being actively worked on. Please use "
           + "`hoodie.datasource.write.keygenerator.class` instead.");
 
+  @AdvancedConfig
+  public static final ConfigOption<String> PARTITION_VALUE_EXTRACTOR = 
ConfigOptions
+      .key(HoodieSyncConfig.META_SYNC_PARTITION_EXTRACTOR_CLASS.key())

Review Comment:
   why do we need this? 
   HoodieTableConfig is common across engines right. 
   



##########
hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableConfig.java:
##########
@@ -335,6 +335,37 @@ public static final String getDefaultPayloadClassName() {
       .sinceVersion("1.0.0")
       .withDocumentation("Key Generator type to determine key generator 
class");
 
+  public static final ConfigProperty<String> PARTITION_VALUE_EXTRACTOR_CLASS = 
ConfigProperty
+      .key("hoodie.datasource.hive_sync.partition_extractor_class")
+      .defaultValue("org.apache.hudi.hive.MultiPartKeysValueExtractor")
+      .withInferFunction(cfg -> {

Review Comment:
   lets go w/ `hoodie.table.partition_value_extractor_class`



##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala:
##########
@@ -527,9 +527,17 @@ object HoodieFileIndex extends Logging {
       
properties.setProperty(DataSourceReadOptions.FILE_INDEX_LISTING_MODE_OVERRIDE.key,
 listingModeOverride)
     }
 
+    val usePartitionValueExtractorOnRead = getConfigValue(options, sqlConf,
+      DataSourceReadOptions.USE_PARTITION_VALUE_EXTRACTOR_ON_READ.key, null)

Review Comment:
   why null as default. we can use 
`DataSourceReadOptions.USE_PARTITION_VALUE_EXTRACTOR_ON_READ.defaultValue()`. 
and the `if` condition in 532 can just check for 
   
   ```
   if (usePartitionValueExtractorOnRead) {
   
   }
   ```
   instead of checking for null



##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieBaseRelation.scala:
##########
@@ -477,6 +477,12 @@ abstract class HoodieBaseRelation(val sqlContext: 
SQLContext,
     getPartitionColumnsAsInternalRowInternal(file,
       metaClient.getBasePath, extractPartitionValuesFromPartitionPath = true)
 
+  protected def usePartitionValueExtractorOnRead(optParams: Map[String, 
String], sparkSession: SparkSession): Boolean = {
+    
optParams.getOrElse(DataSourceReadOptions.USE_PARTITION_VALUE_EXTRACTOR_ON_READ.key,
+      
DataSourceReadOptions.USE_PARTITION_VALUE_EXTRACTOR_ON_READ.defaultValue).toBoolean
 ||
+      ProvidesHoodieConfig.isSchemaEvolutionEnabled(sparkSession)

Review Comment:
   whats this schema evol enabled condition. 
   can you help throw some light



##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieWriterUtils.scala:
##########
@@ -323,6 +323,16 @@ object HoodieWriterUtils {
           && currentPartitionFields != tableConfigPartitionFields) {
           
diffConfigs.append(s"PartitionPath:\t$currentPartitionFields\t$tableConfigPartitionFields\n")
         }
+
+        // Validate partition value extractor
+        val currentPartitionValueExtractor = 
params.getOrElse(HoodieSyncConfig.META_SYNC_PARTITION_EXTRACTOR_CLASS.key(), 
null)

Review Comment:
   yes. take a look at `hoodie.table.partition.fields` and 
`hoodie.datasource.write.partitionpath.field`. 
   one of them is a table property and another is a writer property. 
   
   users can only configure writer property and can't directly set table 
property. 
   
   hudi programmatically maps them



##########
hudi-sync/hudi-sync-common/src/main/java/org/apache/hudi/sync/common/HoodieSyncConfig.java:
##########
@@ -127,19 +125,17 @@ public class HoodieSyncConfig extends HoodieConfig {
           partitionFieldsOpt = HoodieTableConfig.getPartitionFieldProp(cfg)
               .or(() -> 
Option.ofNullable(cfg.getString(KeyGeneratorOptions.PARTITIONPATH_FIELD_NAME)));
         }
+
         if (!partitionFieldsOpt.isPresent()) {
-          return Option.empty();
+          return Option.of("org.apache.hudi.hive.NonPartitionedExtractor");

Review Comment:
   I see its pretty much the same infer func we use here and for the new table 
config we are adding. 
   
   lets move the infer func to common place and use it in both places. 
   
   ok to fix the bug (the non partitioned) if thats the case.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] feat: Use PartitionValueExtractor interface in Spark reader path [hudi]

Reply via email to