Re: [PR] [HUDI-7915] Spark 4 support [hudi]

via GitHub Tue, 06 May 2025 13:09:22 -0700


yihua commented on code in PR #12772:
URL: https://github.com/apache/hudi/pull/12772#discussion_r2076179687



##########
hudi-spark-datasource/hudi-spark4.0.x/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodieSpark40Analysis.scala:
##########
@@ -0,0 +1,212 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hudi.analysis
+
+import org.apache.hudi.{DefaultSource, EmptyRelation, HoodieBaseRelation}
+import org.apache.hudi.SparkAdapterSupport.sparkAdapter
+
+import org.apache.spark.sql.{AnalysisException, SparkSession, SQLContext}
+import org.apache.spark.sql.catalyst.analysis.{ResolveInsertionBase, 
TableOutputResolver}
+import org.apache.spark.sql.catalyst.catalog.{CatalogTable, HiveTableRelation}
+import org.apache.spark.sql.catalyst.plans.logical._
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.errors.DataTypeErrors.toSQLId
+import org.apache.spark.sql.errors.QueryCompilationErrors
+import org.apache.spark.sql.execution.datasources.{HadoopFsRelation, 
LogicalRelation, PreprocessTableInsertion}
+import org.apache.spark.sql.execution.datasources.LogicalRelation
+import org.apache.spark.sql.execution.datasources.v2.DataSourceV2Relation
+import org.apache.spark.sql.hudi.ProvidesHoodieConfig
+import org.apache.spark.sql.hudi.catalog.HoodieInternalV2Table
+import org.apache.spark.sql.sources.InsertableRelation
+import org.apache.spark.sql.types.StructType
+import org.apache.spark.sql.util.PartitioningUtils.normalizePartitionSpec
+
+/**
+ * NOTE: PLEASE READ CAREFULLY
+ *
+ * Since Hudi relations don't currently implement DS V2 Read API, we have to 
fallback to V1 here.
+ * Such fallback will have considerable performance impact, therefore it's 
only performed in cases
+ * where V2 API have to be used. Currently only such use-case is using of 
Schema Evolution feature
+ *
+ * Check out HUDI-4178 for more details
+ */
+case class HoodieSpark40DataSourceV2ToV1Fallback(sparkSession: SparkSession) 
extends Rule[LogicalPlan]

Review Comment:
   Similarly, I see `HoodieSpark40Analysis` is the same as 
`HoodieSpark35Analysis` except the class names.  Could we extract these rules 
as a common rule class?  OK to extract the existing logic in a separate PR so 
it's easier to reuse in this PR.



##########
hudi-spark-datasource/hudi-spark4-common/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodieSpark4Analysis.scala:
##########
@@ -0,0 +1,363 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hudi.analysis
+
+import org.apache.hudi.{DataSourceReadOptions, DefaultSource, 
SparkAdapterSupport}
+import org.apache.hudi.storage.StoragePath
+
+import org.apache.spark.sql.{AnalysisException, SparkSession}
+import org.apache.spark.sql.HoodieSpark4CatalystPlanUtils.MatchResolvedTable
+import org.apache.spark.sql.catalyst.analysis.{EliminateSubqueryAliases, 
NamedRelation, ResolvedFieldName, UnresolvedAttribute, UnresolvedFieldName, 
UnresolvedPartitionSpec}
+import 
org.apache.spark.sql.catalyst.analysis.SimpleAnalyzer.resolveExpressionByPlanChildren
+import org.apache.spark.sql.catalyst.catalog.{CatalogTable, CatalogUtils}
+import org.apache.spark.sql.catalyst.expressions.Expression
+import org.apache.spark.sql.catalyst.plans.logical._
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.catalyst.trees.Origin
+import org.apache.spark.sql.connector.catalog.{Table, V1Table}
+import 
org.apache.spark.sql.connector.catalog.CatalogV2Implicits.IdentifierHelper
+import org.apache.spark.sql.execution.command.DescribeTableCommand
+import org.apache.spark.sql.execution.datasources.{DataSource, LogicalRelation}
+import org.apache.spark.sql.hudi.HoodieSqlCommonUtils.isMetaField
+import org.apache.spark.sql.hudi.ProvidesHoodieConfig
+import 
org.apache.spark.sql.hudi.analysis.HoodieSpark4Analysis.{HoodieV1OrV2Table, 
ResolvesToHudiTable}
+import org.apache.spark.sql.hudi.catalog.HoodieInternalV2Table
+import 
org.apache.spark.sql.hudi.command.{AlterHoodieTableDropPartitionCommand, 
ShowHoodieTablePartitionsCommand, TruncateHoodieTableCommand}
+import org.apache.spark.sql.hudi.command.exception.HoodieAnalysisException
+import org.apache.spark.sql.hudi.logical.TimeTravelRelation
+
+/**
+ * NOTE: PLEASE READ CAREFULLY
+ *
+ * Since Hudi relations don't currently implement DS V2 Read API, we have to 
fallback to V1 here.
+ * Such fallback will have considerable performance impact, therefore it's 
only performed in cases
+ * where V2 API have to be used. Currently only such use-case is using of 
Schema Evolution feature
+ *
+ * Check out HUDI-4178 for more details
+ */
+
+/**
+ * Rule for resolve hoodie's extended syntax or rewrite some logical plan.
+ */
+case class HoodieSpark4ResolveReferences(spark: SparkSession) extends 
Rule[LogicalPlan]

Review Comment:
   `HoodieSpark4Analysis` is exactly the same as `HoodieSpark3Analysis` except 
for class names.  Let's avoid the code duplication, relocate, rename and reuse 
the class.



##########
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodieAnalysis.scala:
##########
@@ -114,7 +126,14 @@ object HoodieAnalysis extends SparkAdapterSupport {
       session => HoodiePostAnalysisRule(session)
     )
 
-    if (HoodieSparkUtils.isSpark3) {
+    if (HoodieSparkUtils.isSpark4) {
+      val spark4PostHocResolutionClass = 
"org.apache.spark.sql.hudi.analysis.HoodieSpark4PostAnalysisRule"

Review Comment:
   Same here, `HoodieSpark4PostAnalysisRule` and `HoodieSpark3PostAnalysisRule` 
have the same logic.



##########
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodieAnalysis.scala:
##########
@@ -114,7 +126,14 @@ object HoodieAnalysis extends SparkAdapterSupport {
       session => HoodiePostAnalysisRule(session)
     )
 
-    if (HoodieSparkUtils.isSpark3) {
+    if (HoodieSparkUtils.isSpark4) {
+      val spark4PostHocResolutionClass = 
"org.apache.spark.sql.hudi.analysis.HoodieSpark4PostAnalysisRule"
+      val spark4PostHocResolution: RuleBuilder =
+        session => instantiateKlass(spark4PostHocResolutionClass, session)
+
+      rules += spark4PostHocResolution
+    } else {
+      // spark 3c

Review Comment:
   ```suggestion
         // spark 3
   ```



##########
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodieAnalysis.scala:
##########
@@ -66,24 +69,33 @@ object HoodieAnalysis extends SparkAdapterSupport {
     val dataSourceV2ToV1Fallback: RuleBuilder =
       session => instantiateKlass(dataSourceV2ToV1FallbackClass, session)
 
-    val spark3ResolveReferencesClass = 
"org.apache.spark.sql.hudi.analysis.HoodieSpark3ResolveReferences"
-    val spark3ResolveReferences: RuleBuilder =
-      session => instantiateKlass(spark3ResolveReferencesClass, session)
+    val spark3PlusResolveReferencesClass = if (HoodieSparkUtils.isSpark4)
+      "org.apache.spark.sql.hudi.analysis.HoodieSpark4ResolveReferences"
+    else
+      "org.apache.spark.sql.hudi.analysis.HoodieSpark3ResolveReferences"
+    val spark3PlusResolveReferences: RuleBuilder =
+      session => instantiateKlass(spark3PlusResolveReferencesClass, session)
 
     // NOTE: PLEASE READ CAREFULLY BEFORE CHANGING
     //
     // It's critical for this rules to follow in this order; re-ordering this 
rules might lead to changes in
     // behavior of Spark's analysis phase (for ex, DataSource V2 to V1 
fallback might not kick in before other rules,
     // leading to all relations resolving as V2 instead of current expectation 
of them being resolved as V1)
-    rules ++= Seq(dataSourceV2ToV1Fallback, spark3ResolveReferences)
+    rules ++= Seq(dataSourceV2ToV1Fallback, spark3PlusResolveReferences)
 
-    if (HoodieSparkUtils.gteqSpark3_5) {
+    if (HoodieSparkUtils.isSpark3_5) {
       rules += (_ => instantiateKlass(
         
"org.apache.spark.sql.hudi.analysis.HoodieSpark35ResolveColumnsForInsertInto"))
     }
+    if (HoodieSparkUtils.isSpark4_0) {
+      rules += (_ => instantiateKlass(
+        
"org.apache.spark.sql.hudi.analysis.HoodieSpark40ResolveColumnsForInsertInto"))
+    }
 
     val resolveAlterTableCommandsClass =
-      if (HoodieSparkUtils.gteqSpark3_5) {
+      if (HoodieSparkUtils.gteqSpark4_0) {
+        "org.apache.spark.sql.hudi.Spark40ResolveHudiAlterTableCommand"

Review Comment:
   Similar here on `ResolveHudiAlterTableCommand` for reuse.  Let's file JIRA 
tickets if we could not address them in this PR.



##########
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodieAnalysis.scala:
##########
@@ -130,8 +149,9 @@ object HoodieAnalysis extends SparkAdapterSupport {
       // Default rules
     )
 
-    val nestedSchemaPruningClass =
-      if (HoodieSparkUtils.gteqSpark3_5) {
+    val nestedSchemaPruningClass = if (HoodieSparkUtils.gteqSpark4_0) {
+        "org.apache.spark.sql.execution.datasources.Spark40NestedSchemaPruning"

Review Comment:
   Same here on code reuse on `Spark40NestedSchemaPruning`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [HUDI-7915] Spark 4 support [hudi]

Reply via email to