[GitHub] [spark] cloud-fan commented on a diff in pull request #41368: [SPARK-43867][SQL] Improve suggested candidates for unresolved attribute

via GitHub Wed, 31 May 2023 08:10:00 -0700


cloud-fan commented on code in PR #41368:
URL: https://github.com/apache/spark/pull/41368#discussion_r1211875623



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/StringUtils.scala:
##########
@@ -82,36 +82,26 @@ object StringUtils extends Logging {
 
   private[spark] def orderSuggestedIdentifiersBySimilarity(
       baseString: String,
-      testStrings: Seq[String]): Seq[String] = {
-    // This method is used to generate suggested list of candidates closest to 
`baseString` from the
-    // list of `testStrings`. Spark uses it to clarify error message in case a 
query refers to non
-    // existent column or attribute. The `baseString` could be single part or 
multi part and this
-    // method will try to match suggestions.
-    // Note that identifiers from `testStrings` could represent columns or 
attributes from different
-    // catalogs, schemas or tables. We preserve suggested identifier prefix 
and reconstruct
-    // multi-part identifier after ordering if there are more than one unique 
prefix in a list. This
-    // will also reconstruct multi-part identifier for the cases of nested 
columns. E.g. for a
-    // table `t` with columns `a`, `b`, `c.d` (nested) and requested column 
`d` we will create
-    // prefixes `t`, `t`, and `t.c`. Since there is more than one distinct 
prefix we will return
-    // sorted suggestions as multi-part identifiers => (`t`.`c`.`d`, `t`.`a`, 
`t`.`b`).
-    val multiPart = UnresolvedAttribute.parseAttributeName(baseString).size > 1
-    if (multiPart) {
-      testStrings.sortBy(LevenshteinDistance.getDefaultInstance.apply(_, 
baseString))
-    } else {
-      val split = testStrings.map { ident =>
-        val parts = 
UnresolvedAttribute.parseAttributeName(ident).map(quoteIfNeeded)
-        (parts.init.mkString("."), parts.last)
-      }
-      val sorted =
-        split.sortBy(pair => 
LevenshteinDistance.getDefaultInstance.apply(pair._2, baseString))
-      if (sorted.map(_._1).toSet.size == 1) {
-        // All identifier belong to the same relation
-        sorted.map(_._2)
+      candidates: Seq[Seq[String]]): Seq[String] = {
+    val baseParts = UnresolvedAttribute.parseAttributeName(baseString)
+    val strippedCandidates =
+      // Group by the qualifier. If all identifiers have the same qualifier, 
strip it.
+      // For example: Seq(`abc`.`def`.`t1`, `abc`.`def`.`t2`) => Seq(`t1`, 
`t2`)
+      if (baseParts.size == 1 && candidates.groupBy(_.dropRight(1)).size == 1) 
{
+        candidates.map(_.takeRight(1))
+      // Group by the qualifier excluding table name. If all identifiers have 
the same prefix
+      // (namespace) excluding table names, strip this prefix.
+      // For example: Seq(`abc`.`def`.`t1`, `abc`.`xyz`.`t2`) => 
Seq(`def`.`t1`, `xyz`.`t2`)
+      } else if (baseParts.size <= 2 && 
candidates.groupBy(_.dropRight(2)).size == 1) {
+        candidates.map(_.takeRight(2))
       } else {
-        // More than one relation
-        sorted.map(x => if (x._1.isEmpty) s"${x._2}" else s"${x._1}.${x._2}")
+        // Some candidates have different qualifiers
+        candidates
       }
-    }
+
+    strippedCandidates
+      .map(quoteNameParts)
+      .sortBy(LevenshteinDistance.getDefaultInstance.apply(_, baseString))

Review Comment:
   One followup we can do: instead of sorting by the quoted qualified name, we 
can sort by the `Seq[String]` directly. The algorithm should sort by the last 
name part first, and then the last second name part, and so on.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] cloud-fan commented on a diff in pull request #41368: [SPARK-43867][SQL] Improve suggested candidates for unresolved attribute

Reply via email to