Re: [PR] [SPARK-50472][SQL] Introduce initial implementation of the single-pass Analyzer [spark]

via GitHub Fri, 20 Dec 2024 04:01:48 -0800


vladimirg-db commented on code in PR #49029:
URL: https://github.com/apache/spark/pull/49029#discussion_r1893855742



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/resolver/NameScope.scala:
##########
@@ -0,0 +1,393 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis.resolver
+
+import java.util.{ArrayDeque, ArrayList, HashSet}
+
+import scala.collection.mutable
+
+import org.apache.spark.sql.catalyst.SQLConfHelper
+import org.apache.spark.sql.catalyst.analysis.{Resolver => NameComparator, 
UnresolvedStar}
+import org.apache.spark.sql.catalyst.expressions.{
+  Alias,
+  Attribute,
+  AttributeSeq,
+  Expression,
+  NamedExpression
+}
+import org.apache.spark.sql.errors.QueryCompilationErrors
+
+/**
+ * The [[NameScope]] is used during the analysis to control the visibility of 
names: plan names
+ * and output attributes. New [[NameScope]] can be created both in the 
[[Resolver]] and in
+ * the [[ExpressionResolver]] using the [[NameScopeStack]] api. The name 
resolution for identifiers
+ * is case-insensitive.
+ *
+ * In this example:
+ *
+ * {{{
+ * WITH table_1_cte AS (
+ *   SELECT
+ *     col1,
+ *     col2,
+ *     col2
+ *   FROM
+ *     table_1
+ * )
+ * SELECT
+ *   table_1_cte.col1,
+ *   table_2.col1
+ * FROM
+ *   table_1_cte
+ * INNER JOIN
+ *   table_2
+ * ON
+ *   table_1_cte.col2 = table_2.col3
+ * ;
+ * }}}
+ *
+ * there are two named subplans in the scope: table_1_cte -> [col1, col2, 
col2] and
+ * table_2 -> [col1, col3].
+ *
+ * State breakout:
+ * - `planOutputs`: list of named plan outputs. Order matters here (e.g. to 
correctly expand `*`).
+ *   Can contain duplicate names, since it's possible to select same column 
twice, or to select
+ *   columns with the same name from different relations. 
[[OptionalIdentifierMap]] is used here,
+ *   since some plans don't have an explicit name, so output attributes from 
those plans will reside
+ *   under the `None` key.
+ *   In our example it will be {{{ [(table_1_cte, [col1, col2, col2]), 
(table_2, [col1, col3])] }}}
+ *
+ * - `planNameToOffset`: mapping from plan output names to their offsets in 
the `planOutputs` array.
+ *   It's used to lookup attributes by plan output names (multipart names are 
not supported yet).
+ *   In our example it will be {{{ [table_1_cte -> 0, table_2 -> 1] }}}
+ */
+class NameScope extends SQLConfHelper {
+  private val planOutputs = new ArrayList[PlanOutput]()
+  private val planNameToOffset = new OptionalIdentifierMap[Int]
+  private val nameComparator: NameComparator = conf.resolver
+  private val existingAliases = new HashSet[String]
+
+  /**
+   * Register the named plan output in this [[NameScope]]. The named plan is 
usually a
+   * [[NamedRelation]]. `attributes` sequence can contain duplicate names both 
for this named plan
+   * and for the scope in general, despite the fact that their further 
resolution _may_ throw an
+   * error in case of ambiguous reference. After calling this method, the code 
can lookup the
+   * attributes using `get*` methods of this [[NameScope]].
+   *
+   * Duplicate plan names are merged into the same [[PlanOutput]]. For 
example, this query:
+   *
+   * {{{ SELECT t.* FROM (SELECT * FROM VALUES (1)) as t, (SELECT * FROM 
VALUES (2)) as t; }}}
+   *
+   * will have the following output schema:
+   *
+   * {{{ [col1, col1] }}}
+   *
+   * Same logic applies for the unnamed plan outputs. This query:
+   *
+   * {{{ SELECT * FROM (SELECT * FROM VALUES (1)), (SELECT * FROM VALUES (2)); 
}}}
+   *
+   * will have the same output schema:
+   *
+   * {{{ [col1, col1] }}}
+   *
+   * @param name The name of this named plan.
+   * @param attributes The output of this named plan. Can contain duplicate 
names.
+   */
+  def update(name: String, attributes: Seq[Attribute]): Unit = {
+    update(attributes, Some(name))
+  }
+
+  /**
+   * Register the unnamed plan output in this [[NameScope]]. Some examples of 
the unnamed plan are
+   * [[Project]] and [[Aggregate]].
+   *
+   * See the [[update]] method for more details.
+   *
+   * @param attributes The output of the unnamed plan. Can contain duplicate 
names.
+   */
+  def +=(attributes: Seq[Attribute]): Unit = {
+    update(attributes)
+  }
+
+  /**
+   * Get all the attributes from all the plans registered in this 
[[NameScope]]. The output can
+   * contain duplicate names. This is used for star (`*`) resolution.
+   */
+  def getAllAttributes: Seq[Attribute] = {
+    val attributes = new mutable.ArrayBuffer[Attribute]
+
+    planOutputs.forEach(planOutput => {
+      attributes.appendAll(planOutput.attributes)
+    })
+
+    attributes.toSeq
+  }
+
+  /**
+   * Expand the [[UnresolvedStar]] using `planOutputs`. The expected use case 
for this method is
+   * star expansion inside [[Project]]. Since [[Project]] has only one child, 
we assert that the
+   * size of `planOutputs` is 1, otherwise the query is malformed.
+   *
+   * Some examples of queries with a star:
+   *
+   *  - Star without a target:
+   *  {{{ SELECT * FROM VALUES (1,  2,  3) AS t(a, b, c); }}}
+   *  - Star with a multipart name target:
+   *  {{{ SELECT catalog1.database1.table1.* FROM catalog1.database1.table1; 
}}}
+   *  - Star with a struct target:
+   *  {{{ SELECT d.* FROM VALUES (named_struct('a', 1, 'b', 2)) AS t(d); }}}
+   *  - Star as an argument to a function:
+   *  {{{ SELECT concat_ws('', *) AS result FROM VALUES (1, 2, 3) AS t(a, b, 
c); }}}
+   *
+   * It is resolved by correctly resolving the star qualifier.
+   * Please check [[UnresolvedStarBase.expandStar]] for more details.
+   *
+   * @param unresolvedStar [[UnresolvedStar]] to expand.
+   * @return The output of a plan expanded from the star.
+   */
+  def expandStar(unresolvedStar: UnresolvedStar): Seq[NamedExpression] = {
+    if (planOutputs.size != 1) {
+      throw QueryCompilationErrors.invalidStarUsageError("query", 
Seq(unresolvedStar))
+    }
+
+    planOutputs.get(0).expandStar(unresolvedStar)
+  }
+
+  /**
+   * Get all matched attributes by a multipart name. It returns [[Attribute]]s 
when we resolve a
+   * simple column or an alias name from a lower operator. However this 
function can also return
+   * [[Alias]]es in case we access a struct field or a map value using some 
key.
+   *
+   * Example that contains those major use-cases:
+   *
+   * {{{
+   *  SELECT col1, a, col2.field, col3.struct.field, col4.key
+   *  FROM (SELECT *, col5 AS a FROM t);
+   * }}}
+   *
+   * has a Project list that looks like this:
+   *
+   * {{{
+   *   AttributeReference(col1),
+   *   AttributeReference(a),
+   *   Alias(col2.field, field),
+   *   Alias(col3.struct.field, field),
+   *   Alias(col4[CAST(key AS INT)], key)
+   * }}}
+   *
+   * Also, see [[AttributeSeq.resolve]] for more details.
+   *
+   * Since there can be several identical attribute names for several named 
plans, this function
+   * can return multiple values:
+   * - 0 values: No matched attributes
+   * - 1 value: Unique attribute matched
+   * - 1+ values: Ambiguity, several attributes matched
+   *
+   * One example of a query with an attribute that has a multipart name:
+   *
+   * {{{ SELECT catalog1.database1.table1.col1 FROM catalog1.database1.table1; 
}}}
+   *
+   * @param multipartName Multipart attribute name. Can be of several forms:
+   *   - `catalog.database.table.column`
+   *   - `database.table.column`
+   *   - `table.column`
+   *   - `column`
+   * @return All the attributes matched by the `multipartName`, encapsulated 
in a [[NameTarget]].
+   */
+  def matchMultipartName(multipartName: Seq[String]): NameTarget = {
+    val candidates = new mutable.ArrayBuffer[Expression]
+    val allAttributes = new mutable.ArrayBuffer[Attribute]
+    var aliasName: Option[String] = None
+
+    planOutputs.forEach(planOutput => {
+      allAttributes.appendAll(planOutput.attributes)
+      val nameTarget = planOutput.matchMultipartName(multipartName)
+      if (nameTarget.aliasName.isDefined) {
+        aliasName = nameTarget.aliasName

Review Comment:
   Yeah, this code is not great. I'm planning to do a simplification PR for the 
`NameScope` once the UNIONs are implemented.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-50472][SQL] Introduce initial implementation of the single-pass Analyzer [spark]

Reply via email to