Re: [PR] fix: Fixing partitioned RLI for nested partitions [hudi]

via GitHub Tue, 23 Jun 2026 10:20:15 -0700


hudi-agent commented on code in PR #18378:
URL: https://github.com/apache/hudi/pull/18378#discussion_r3461620663



##########
hudi-common/src/main/java/org/apache/hudi/common/util/PartitionPathEncodeUtils.java:
##########
@@ -56,6 +56,7 @@ public class PartitionPathEncodeUtils {
 
     charToEscapeFilename.set('_');
     charToEscapeFilename.set('-');
+    charToEscapeFilename.set('.');

Review Comment:
   🤖 I want to make sure the backwards-compat scope here is fully covered. 
`charToEscape` feeds `PartitionPathEncodeUtils.escapePathName`, which is also 
used in the keygen write path (`StringPartitionPathFormatter.encode` -> 
`PartitionPathFormatterBase.combine`) to encode partition *values* whenever 
`URL_ENCODE_PARTITIONING` is on, not just MDT/RLI. For an existing url-encoded 
table with a dot in a partition value (e.g. `version=1.2.3`), older data was 
written to `version=1.2.3` while new writes would now go to `version=1%2E2%2E3` 
— splitting the same logical partition across two physical dirs. Could you 
confirm that case is handled (or gate the dot-escaping to the MDT path)?



##########
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestRecordLevelIndex.scala:
##########
@@ -711,6 +711,74 @@ class TestRecordLevelIndex extends 
RecordLevelIndexTestBase with SparkDatasetMix
       s"Failed to create empty replacement file $candidateBaseFile")
     candidateBaseFile.getName
   }
+
+  @Test
+  def 
testPartitionedRecordLevelIndexWithHiveStylePartitioningAndDotInPartitionField():
 Unit = {
+    initMetaClient(HoodieTableType.COPY_ON_WRITE)
+    val dataGen = new HoodieTestDataGenerator()
+    val inserts = dataGen.generateInserts("001", 10)
+    val insertDf = toDataset(spark, inserts)
+
+    // Use fare.currency as partition field to test dots in partition field 
names with Hive-style partitioning
+    val options = Map(HoodieWriteConfig.TBL_NAME.key -> "hoodie_test",
+      DataSourceWriteOptions.TABLE_TYPE.key -> 
HoodieTableType.COPY_ON_WRITE.name(),
+      RECORDKEY_FIELD.key -> "_row_key",
+      PARTITIONPATH_FIELD.key -> "fare.currency",
+      HoodieTableConfig.ORDERING_FIELDS.key -> "timestamp",
+      HoodieMetadataConfig.GLOBAL_RECORD_LEVEL_INDEX_ENABLE_PROP.key() -> 
"false",
+      HoodieMetadataConfig.RECORD_LEVEL_INDEX_ENABLE_PROP.key() -> "true",
+      HoodieMetadataConfig.STREAMING_WRITE_ENABLED.key() -> "false",
+      HoodieCompactionConfig.INLINE_COMPACT.key() -> "false",
+      HoodieIndexConfig.INDEX_TYPE.key() -> RECORD_LEVEL_INDEX.name(),
+      DataSourceWriteOptions.HIVE_STYLE_PARTITIONING.key() -> "true")
+
+    insertDf.write.format("hudi")
+      .options(options)
+      .mode(SaveMode.Overwrite)
+      .save(basePath)
+
+    assertEquals(10, spark.read.format("hudi").load(basePath).count())
+
+    val props = 
TypedProperties.fromMap(JavaConverters.mapAsJavaMapConverter(options).asJava)
+    val writeConfig = HoodieWriteConfig.newBuilder()
+      .withProps(props)
+      .withPath(basePath)
+      .build()
+
+    val metadata = metadataWriter(writeConfig).getTableMetadata
+    val recordKeys = inserts.asScala.map(i => 
i.getRecordKey).asJava.stream().collect(Collectors.toList())

Review Comment:
   🤖 nit: could you simplify this to 
`inserts.asScala.map(_.getRecordKey).toList.asJava`? The round-trip through 
`.asJava.stream().collect(Collectors.toList())` and the explicit lambda are 
both a bit more Java-esque than typical Scala here.
   
   <sub><i>⚠️ AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



##########
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestRecordLevelIndex.scala:
##########
@@ -711,6 +711,74 @@ class TestRecordLevelIndex extends 
RecordLevelIndexTestBase with SparkDatasetMix
       s"Failed to create empty replacement file $candidateBaseFile")
     candidateBaseFile.getName
   }
+
+  @Test
+  def 
testPartitionedRecordLevelIndexWithHiveStylePartitioningAndDotInPartitionField():
 Unit = {
+    initMetaClient(HoodieTableType.COPY_ON_WRITE)
+    val dataGen = new HoodieTestDataGenerator()
+    val inserts = dataGen.generateInserts("001", 10)
+    val insertDf = toDataset(spark, inserts)
+
+    // Use fare.currency as partition field to test dots in partition field 
names with Hive-style partitioning
+    val options = Map(HoodieWriteConfig.TBL_NAME.key -> "hoodie_test",
+      DataSourceWriteOptions.TABLE_TYPE.key -> 
HoodieTableType.COPY_ON_WRITE.name(),
+      RECORDKEY_FIELD.key -> "_row_key",
+      PARTITIONPATH_FIELD.key -> "fare.currency",
+      HoodieTableConfig.ORDERING_FIELDS.key -> "timestamp",
+      HoodieMetadataConfig.GLOBAL_RECORD_LEVEL_INDEX_ENABLE_PROP.key() -> 
"false",
+      HoodieMetadataConfig.RECORD_LEVEL_INDEX_ENABLE_PROP.key() -> "true",
+      HoodieMetadataConfig.STREAMING_WRITE_ENABLED.key() -> "false",
+      HoodieCompactionConfig.INLINE_COMPACT.key() -> "false",
+      HoodieIndexConfig.INDEX_TYPE.key() -> RECORD_LEVEL_INDEX.name(),
+      DataSourceWriteOptions.HIVE_STYLE_PARTITIONING.key() -> "true")
+
+    insertDf.write.format("hudi")
+      .options(options)
+      .mode(SaveMode.Overwrite)
+      .save(basePath)
+
+    assertEquals(10, spark.read.format("hudi").load(basePath).count())
+
+    val props = 
TypedProperties.fromMap(JavaConverters.mapAsJavaMapConverter(options).asJava)
+    val writeConfig = HoodieWriteConfig.newBuilder()
+      .withProps(props)
+      .withPath(basePath)
+      .build()
+
+    val metadata = metadataWriter(writeConfig).getTableMetadata
+    val recordKeys = inserts.asScala.map(i => 
i.getRecordKey).asJava.stream().collect(Collectors.toList())
+
+    // Verify record index entries for the USD partition. 
HoodieTestDataGenerator's
+    // FARE_NESTED_SCHEMA populator hard-codes currency="USD", so all 10 
records land
+    // in fare.currency=USD and the assertion below is deterministic.
+    val usdPartitionLocations = readRecordIndex(metadata, recordKeys, 
HOption.of("fare.currency=USD"))
+
+    // All records should be found

Review Comment:
   🤖 nit: `// All records should be found` is restating what `assertEquals(10, 
...)` already expresses — it might be worth dropping it.
   
   <sub><i>⚠️ AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] fix: Fixing partitioned RLI for nested partitions [hudi]

Reply via email to