rahil-c commented on code in PR #17904:
URL: https://github.com/apache/hudi/pull/17904#discussion_r2757198006
##########
hudi-client/hudi-spark-client/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/SparkBasicSchemaEvolution.scala:
##########
@@ -20,32 +20,118 @@
package org.apache.spark.sql.execution.datasources.parquet
import org.apache.hudi.SparkAdapterSupport.sparkAdapter
-
+import org.apache.hudi.common.model.HoodieFileFormat
import org.apache.spark.sql.HoodieSchemaUtils
import org.apache.spark.sql.catalyst.expressions.UnsafeProjection
-import org.apache.spark.sql.types.StructType
+import org.apache.spark.sql.execution.datasources.SparkSchemaTransformUtils
+import org.apache.spark.sql.types.{ArrayType, DataType, MapType, StructField,
StructType}
/**
- * Intended to be used just with HoodieSparkParquetReader to avoid any
java/scala issues
+ * Generic schema evolution handler for different file formats.
+ * Supports Parquet (default), and Lance currently.
Review Comment:
@the-other-tim-brown
Ack, I mentioned this response
https://github.com/apache/hudi/pull/17904#discussion_r2748679238 requesting a
little more clarification on how you would like the refactor to look like,
since right now its not clear to me on where this null padding piece would be
placed then. I am open to your suggestion, just trying to understand how it
would look like.
Assuming then we do not do any null padding in either top level fields or
nested fields in the following areas:
*
https://github.com/apache/hudi/pull/17904/changes#diff-56d3b110e2b04263ed60368227bddd9bef085799f4917701f936cbc9f7f71572R77
*
https://github.com/apache/hudi/pull/17904/changes#diff-56d3b110e2b04263ed60368227bddd9bef085799f4917701f936cbc9f7f71572R125
Then the UnsafeProjection we are currently returning in these function
https://github.com/apache/hudi/pull/17904/changes#diff-56d3b110e2b04263ed60368227bddd9bef085799f4917701f936cbc9f7f71572R81
would not be fully correct as it does not align with the evolved schema.
So i am wondering if your idea then is before we apply this projection to
the iterator of unsafe rows
that we are modifying the existing projection or recreating a new projection
with the null padding
in the `SparkBasicSchemaEvolution`
https://github.com/apache/hudi/pull/17904/changes#diff-8ed98fba80253c795ae16cb143f54eba4cc9616774c85ce8eb4ad9a83f422863R127
Or in the `SparkLanceReaderBase`
https://github.com/apache/hudi/pull/17904/changes#diff-bdccaaaeb061abdf550efec86661f9d3790c66d53e04b1ed2e9cf9a61ea06e13R135
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]