Re: [PR] [HUDI-8898] Support INSERT SQL statement with a subset of columns in Spark 3.5 [hudi]

via GitHub Thu, 23 Jan 2025 05:59:49 -0800


KnightChess commented on code in PR #12692:
URL: https://github.com/apache/hudi/pull/12692#discussion_r1927006574



##########
hudi-spark-datasource/hudi-spark3.5.x/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodieSpark35Analysis.scala:
##########
@@ -64,3 +72,132 @@ case class 
HoodieSpark35DataSourceV2ToV1Fallback(sparkSession: SparkSession) ext
     LogicalRelation(relation, output, catalogTable, isStreaming = false)
   }
 }
+
+/**
+ * In Spark 3.5, the following Resolution rules are removed,
+ * [[ResolveUserSpecifiedColumns]] and [[ResolveDefaultColumns]]
+ * (see code changes in [[org.apache.spark.sql.catalyst.analysis.Analyzer]]
+ * from https://github.com/apache/spark/pull/41262).
+ * The same logic of resolving the user specified columns and default values,
+ * which are required for a subset of columns as user specified compared to 
the table
+ * schema to work properly, are deferred to [[PreprocessTableInsertion]] for 
v1 INSERT.
+ *
+ * Note that [[HoodieAnalysis]] intercepts the [[InsertIntoStatement]] after 
Spark's built-in
+ * Resolution rules are applies, the logic of resolving the user specified 
columns and default
+ * values may no longer be applied. To make INSERT with a subset of columns 
specified by user
+ * to work, this custom resolution rule 
[[HoodieSpark35ResolveColumnsForInsertInto]] is added
+ * to achieve the same, before converting [[InsertIntoStatement]] into
+ * [[InsertIntoHoodieTableCommand]].
+ *
+ * Also note that, the project logic in [[ResolveImplementationsEarly]] for 
INSERT is still
+ * needed in the case of INSERT with all columns in a different ordering.
+ */
+case class HoodieSpark35ResolveColumnsForInsertInto() extends 
ResolveInsertionBase {
+  // NOTE: This is copied from [[PreprocessTableInsertion]] with additional 
handling of Hudi relations
+  override def apply(plan: LogicalPlan): LogicalPlan = {
+    plan match {
+      case i@InsertIntoStatement(table, _, _, query, _, _, _)
+        if table.resolved && query.resolved
+          && i.userSpecifiedCols.nonEmpty && 
i.table.isInstanceOf[LogicalRelation]
+          && 
sparkAdapter.isHoodieTable(i.table.asInstanceOf[LogicalRelation].catalogTable.get)
 =>
+        table match {

Review Comment:
   > Also are these relations v2? It seems that they are mixed with v1 logic.
   
   hudi is currently mostly implemented using the v1 approach, and v2 relations 
in writes are generally fallback to v1 relations. So i think it's ok now 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [HUDI-8898] Support INSERT SQL statement with a subset of columns in Spark 3.5 [hudi]

Reply via email to