[GitHub] [hudi] zhedoubushishi commented on a diff in pull request #6163: [HUDI-4440] Treat boostrapped table as non-partitioned in HudiFileIndex if partit…

GitBox Thu, 21 Jul 2022 15:09:59 -0700


zhedoubushishi commented on code in PR #6163:
URL: https://github.com/apache/hudi/pull/6163#discussion_r927140605



##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/SparkHoodieTableFileIndex.scala:
##########
@@ -96,10 +97,24 @@ class SparkHoodieTableFileIndex(spark: SparkSession,
         val partitionFields = partitionColumns.get().map(column => 
StructField(column, StringType))
         StructType(partitionFields)
       } else {
-        val partitionFields = partitionColumns.get().map(column =>
-          nameFieldMap.getOrElse(column, throw new 
IllegalArgumentException(s"Cannot find column: '" +
-            s"$column' in the schema[${schema.fields.mkString(",")}]")))
-        StructType(partitionFields)
+        val partitionFields = partitionColumns.get().filter(column => 
nameFieldMap.contains(column))
+          .map(column => nameFieldMap.apply(column))
+
+        if (partitionFields.size != partitionColumns.get().size) {
+          val isBootstrapTable = 
BootstrapIndex.getBootstrapIndex(metaClient).useIndex()
+          if (isBootstrapTable) {
+            // For bootstrapped tables its possible the schema does not 
contain partition field when source table

Review Comment:
   Hi @nsivabalan.
   In this case, let' say the source table is a Hive style partitioned parquet 
table(partition column is not included in the parquet files) and after 
bootstrapping, we generated a partitioned Hudi table. But when reading this 
Hudi table, now we read it as a non-partitioned table because the partition 
column is not included in the data files.
   
   Yes in the long term, we should be able to infer the partition column and 
schema type in the case of bootstrapped tables but it is a more complex issue 
to resolve at this time.
   
   We identified that the partition validation logic mainly serves the purpose 
to allow partition pruning in HoodieFileIndex.
   
   Rather than entirely breaking bootstrap feature we have decided in the case 
of bootstrapped tables to ignore this validation and treat queries as 
non-partitioned tables. The impact of this is that queries will not see the 
effects of partition pruning through Hudi.
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] zhedoubushishi commented on a diff in pull request #6163: [HUDI-4440] Treat boostrapped table as non-partitioned in HudiFileIndex if partit…

Reply via email to