[GitHub] [hudi] nsivabalan commented on a diff in pull request #9273: [HUDI-6320] Fix partition parsing in Spark file index for custom keygen

via GitHub Mon, 24 Jul 2023 10:33:04 -0700


nsivabalan commented on code in PR #9273:
URL: https://github.com/apache/hudi/pull/9273#discussion_r1272554388



##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/SparkHoodieTableFileIndex.scala:
##########
@@ -115,14 +112,16 @@ class SparkHoodieTableFileIndex(spark: SparkSession,
       // Note that key generator class name could be null
       val keyGeneratorClassName = tableConfig.getKeyGeneratorClassName
       if 
(classOf[TimestampBasedKeyGenerator].getName.equalsIgnoreCase(keyGeneratorClassName)
-        || 
classOf[TimestampBasedAvroKeyGenerator].getName.equalsIgnoreCase(keyGeneratorClassName))
 {
+        || 
classOf[TimestampBasedAvroKeyGenerator].getName.equalsIgnoreCase(keyGeneratorClassName)
+        || 
classOf[CustomKeyGenerator].getName.equalsIgnoreCase(keyGeneratorClassName)
+        || 
classOf[CustomAvroKeyGenerator].getName.equalsIgnoreCase(keyGeneratorClassName))
 {

Review Comment:
   we might need to consider one more thing here. 
   looks like for timestamp based partition field, we are harding the data type 
to string. 
   
   so, within custom key gen, there could be more than 1 field. and so we need 
to intercept the type of each field and only hard code the data type to string 
if its of timestamp type. 
   



##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala:
##########
@@ -342,6 +342,20 @@ object HoodieFileIndex extends Logging {
     if (listingModeOverride != null) {
       
properties.setProperty(DataSourceReadOptions.FILE_INDEX_LISTING_MODE_OVERRIDE.key,
 listingModeOverride)
     }
+    val tableConfig = metaClient.getTableConfig
+    val partitionColumns = tableConfig.getPartitionFields
+    if (partitionColumns.isPresent) {
+      val keyGeneratorClassName = tableConfig.getKeyGeneratorClassName
+      // NOTE: A custom key generator with multiple fields could have 
non-encoded slashes in the partition columns'
+      //       value. We might not be able to properly parse partition-values 
from the listed partition-paths. Fallback
+      //       to eager listing in this case.
+      val isCustomKeyGenerator = 
(classOf[CustomKeyGenerator].getName.equalsIgnoreCase(keyGeneratorClassName)
+        || 
classOf[CustomAvroKeyGenerator].getName.equalsIgnoreCase(keyGeneratorClassName))
+      val hasMultiplePartitionFields = partitionColumns.get().length > 1
+      if (hasMultiplePartitionFields && isCustomKeyGenerator) {
+        
properties.setProperty(DataSourceReadOptions.FILE_INDEX_LISTING_MODE_OVERRIDE.key,
 DataSourceReadOptions.FILE_INDEX_LISTING_MODE_EAGER)
+      }
+    }

Review Comment:
   should we also consider complexKeyGen w/ multiple fields. 
   so all that matters here is, if there are more than 1 partitionColumns. we 
don't really need to check for key gen class. 
   



##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/SparkHoodieTableFileIndex.scala:
##########
@@ -115,14 +112,16 @@ class SparkHoodieTableFileIndex(spark: SparkSession,
       // Note that key generator class name could be null
       val keyGeneratorClassName = tableConfig.getKeyGeneratorClassName
       if 
(classOf[TimestampBasedKeyGenerator].getName.equalsIgnoreCase(keyGeneratorClassName)
-        || 
classOf[TimestampBasedAvroKeyGenerator].getName.equalsIgnoreCase(keyGeneratorClassName))
 {
+        || 
classOf[TimestampBasedAvroKeyGenerator].getName.equalsIgnoreCase(keyGeneratorClassName)
+        || 
classOf[CustomKeyGenerator].getName.equalsIgnoreCase(keyGeneratorClassName)
+        || 
classOf[CustomAvroKeyGenerator].getName.equalsIgnoreCase(keyGeneratorClassName))
 {

Review Comment:
   we can take it as a follow up as well. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] nsivabalan commented on a diff in pull request #9273: [HUDI-6320] Fix partition parsing in Spark file index for custom keygen

Reply via email to