[GitHub] [hudi] umehrot2 commented on a diff in pull request #6163: [HUDI-4440] Treat boostrapped table as non-partitioned in HudiFileIndex if partit…

GitBox Fri, 22 Jul 2022 16:09:08 -0700


umehrot2 commented on code in PR #6163:
URL: https://github.com/apache/hudi/pull/6163#discussion_r928037390



##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/SparkHoodieTableFileIndex.scala:
##########
@@ -96,10 +97,24 @@ class SparkHoodieTableFileIndex(spark: SparkSession,
         val partitionFields = partitionColumns.get().map(column => 
StructField(column, StringType))
         StructType(partitionFields)
       } else {
-        val partitionFields = partitionColumns.get().map(column =>
-          nameFieldMap.getOrElse(column, throw new 
IllegalArgumentException(s"Cannot find column: '" +
-            s"$column' in the schema[${schema.fields.mkString(",")}]")))
-        StructType(partitionFields)
+        val partitionFields = partitionColumns.get().filter(column => 
nameFieldMap.contains(column))
+          .map(column => nameFieldMap.apply(column))
+
+        if (partitionFields.size != partitionColumns.get().size) {

Review Comment:
   I agree that in general this is just a temporary solution to not break 
bootstrap tables. This is tricky to handle. Because its not just about 
obtaining the partition schema from source, but also extracting the partition 
column values from the source path and writing them as correct data type in the 
target location. I remember having several discussions about it a year back.
   
   As per your question => 
https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/bootstrap/ParquetBootstrapMetadataHandler.java#L61
 we are simply reading the source file footer right now to get the source 
schema. I think EMR team can take it up in the next release, but for now we 
should atleast prevent failures. @rahil-c can you create a jira to track this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] umehrot2 commented on a diff in pull request #6163: [HUDI-4440] Treat boostrapped table as non-partitioned in HudiFileIndex if partit…

Reply via email to