Re: [PR] [SPARK-50032][SQL] Allow use of fully qualified collation name [spark]

via GitHub Mon, 28 Oct 2024 04:17:31 -0700


stevomitric commented on code in PR #48546:
URL: https://github.com/apache/spark/pull/48546#discussion_r1818605937



##########
sql/core/src/test/scala/org/apache/spark/sql/CollationSuite.scala:
##########
@@ -2039,4 +2039,35 @@ class CollationSuite extends DatasourceV2SQLBase with 
AdaptiveSparkPlanHelper {
     checkAnswer(sql("SELECT NAME FROM collations() WHERE ICU_VERSION is null"),
       Seq(Row("UTF8_BINARY"), Row("UTF8_LCASE")))
   }
+
+  test("fully qualified name") {
+    // Make sure that the collation expression returns the correct fully 
qualified name.
+    Seq[String]("UTF8_BINARY", "UTF8_LCASE", "UNICODE", 
"UNICODE_CI_AI").foreach { collation =>
+      val df = sql(s"SELECT collation('a' collate $collation)")
+      checkAnswer(df,
+        
Seq(Row(s"${CollationFactory.CATALOG}.${CollationFactory.SCHEMA}.$collation")))
+    }
+
+    // Make sure the user can specify the fully qualified name as a collation 
name.
+    Seq[String]("contains", "startswith", "endswith").foreach{ binaryFunction 
=>
+      Seq[String]("UTF8_BINARY", "UTF8_LCASE", "UNICODE", 
"UNICODE_CI_AI").foreach { collation =>
+        val dfRegularName = sql(
+          s"SELECT $binaryFunction('a' collate $collation, 'A' collate 
$collation)")
+        val dfFullyQualifiedName = sql(
+          s"SELECT $binaryFunction('a' collate SYSTEM.BUILTIN.$collation, 'A' 
collate $collation)")
+        checkAnswer(dfRegularName, dfFullyQualifiedName.collect())
+      }
+    }
+
+    // Wrong collation names raise a Spark runtime exception.
+    intercept[SparkException](sql("SELECT 'a' COLLATE 
SYSTEM.BUILTIN2.UTF8_BINARY"))

Review Comment:
   Done.



##########
sql/core/src/test/scala/org/apache/spark/sql/CollationSuite.scala:
##########
@@ -2039,4 +2039,35 @@ class CollationSuite extends DatasourceV2SQLBase with 
AdaptiveSparkPlanHelper {
     checkAnswer(sql("SELECT NAME FROM collations() WHERE ICU_VERSION is null"),
       Seq(Row("UTF8_BINARY"), Row("UTF8_LCASE")))
   }
+
+  test("fully qualified name") {
+    // Make sure that the collation expression returns the correct fully 
qualified name.
+    Seq[String]("UTF8_BINARY", "UTF8_LCASE", "UNICODE", 
"UNICODE_CI_AI").foreach { collation =>
+      val df = sql(s"SELECT collation('a' collate $collation)")
+      checkAnswer(df,
+        
Seq(Row(s"${CollationFactory.CATALOG}.${CollationFactory.SCHEMA}.$collation")))
+    }
+
+    // Make sure the user can specify the fully qualified name as a collation 
name.
+    Seq[String]("contains", "startswith", "endswith").foreach{ binaryFunction 
=>
+      Seq[String]("UTF8_BINARY", "UTF8_LCASE", "UNICODE", 
"UNICODE_CI_AI").foreach { collation =>
+        val dfRegularName = sql(
+          s"SELECT $binaryFunction('a' collate $collation, 'A' collate 
$collation)")
+        val dfFullyQualifiedName = sql(
+          s"SELECT $binaryFunction('a' collate SYSTEM.BUILTIN.$collation, 'A' 
collate $collation)")
+        checkAnswer(dfRegularName, dfFullyQualifiedName.collect())
+      }
+    }
+
+    // Wrong collation names raise a Spark runtime exception.

Review Comment:
   Because it was already being used for this type of message. See [this 
line](https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/CollationSuite.scala#L193).
 Modified a comment a bit.



##########
sql/core/src/test/scala/org/apache/spark/sql/CollationSuite.scala:
##########
@@ -2015,4 +2015,35 @@ class CollationSuite extends DatasourceV2SQLBase with 
AdaptiveSparkPlanHelper {
     checkAnswer(sql("SELECT NAME FROM collations() WHERE ICU_VERSION is null"),
       Seq(Row("UTF8_BINARY"), Row("UTF8_LCASE")))
   }
+
+  test("fully qualified name") {
+    // Make sure that the collation expression returns the correct fully 
qualified name.
+    Seq[String]("UTF8_BINARY", "UTF8_LCASE", "UNICODE", 
"UNICODE_CI_AI").foreach { collation =>

Review Comment:
   Merged two tests that iterate over the collation identifiers.



##########
sql/core/src/test/scala/org/apache/spark/sql/CollationSuite.scala:
##########
@@ -2015,4 +2015,35 @@ class CollationSuite extends DatasourceV2SQLBase with 
AdaptiveSparkPlanHelper {
     checkAnswer(sql("SELECT NAME FROM collations() WHERE ICU_VERSION is null"),
       Seq(Row("UTF8_BINARY"), Row("UTF8_LCASE")))
   }
+
+  test("fully qualified name") {
+    // Make sure that the collation expression returns the correct fully 
qualified name.
+    Seq[String]("UTF8_BINARY", "UTF8_LCASE", "UNICODE", 
"UNICODE_CI_AI").foreach { collation =>
+      val df = sql(s"SELECT collation('a' collate $collation)")
+      checkAnswer(df,
+        
Seq(Row(s"${CollationFactory.CATALOG}.${CollationFactory.SCHEMA}.$collation")))
+    }
+
+    // Make sure the user can specify the fully qualified name as a collation 
name.
+    Seq[String]("contains", "startswith", "endswith").foreach{ binaryFunction 
=>
+      Seq[String]("UTF8_BINARY", "UTF8_LCASE", "UNICODE", 
"UNICODE_CI_AI").foreach { collation =>
+        val dfRegularName = sql(
+          s"SELECT $binaryFunction('a' collate $collation, 'A' collate 
$collation)")
+        val dfFullyQualifiedName = sql(
+          s"SELECT $binaryFunction('a' collate SYSTEM.BUILTIN.$collation, 'A' 
collate $collation)")

Review Comment:
   Changed the value in the test, however they should be completely 
case-insensitive, so it shouldn't matter.
   The initial PR that introduced these values in `CollationFactory.java` 
([here](https://github.com/apache/spark/pull/47364)) has them uppercased.



##########
sql/core/src/test/scala/org/apache/spark/sql/CollationSuite.scala:
##########
@@ -2015,4 +2015,35 @@ class CollationSuite extends DatasourceV2SQLBase with 
AdaptiveSparkPlanHelper {
     checkAnswer(sql("SELECT NAME FROM collations() WHERE ICU_VERSION is null"),
       Seq(Row("UTF8_BINARY"), Row("UTF8_LCASE")))
   }
+
+  test("fully qualified name") {
+    // Make sure that the collation expression returns the correct fully 
qualified name.
+    Seq[String]("UTF8_BINARY", "UTF8_LCASE", "UNICODE", 
"UNICODE_CI_AI").foreach { collation =>
+      val df = sql(s"SELECT collation('a' collate $collation)")
+      checkAnswer(df,
+        
Seq(Row(s"${CollationFactory.CATALOG}.${CollationFactory.SCHEMA}.$collation")))
+    }
+
+    // Make sure the user can specify the fully qualified name as a collation 
name.
+    Seq[String]("contains", "startswith", "endswith").foreach{ binaryFunction 
=>
+      Seq[String]("UTF8_BINARY", "UTF8_LCASE", "UNICODE", 
"UNICODE_CI_AI").foreach { collation =>
+        val dfRegularName = sql(
+          s"SELECT $binaryFunction('a' collate $collation, 'A' collate 
$collation)")
+        val dfFullyQualifiedName = sql(
+          s"SELECT $binaryFunction('a' collate SYSTEM.BUILTIN.$collation, 'A' 
collate $collation)")
+        checkAnswer(dfRegularName, dfFullyQualifiedName.collect())
+      }
+    }

Review Comment:
   Is this a valid thing to do? It throws an `INTERNAL_ERROR`.



##########
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##########
@@ -1154,9 +1154,33 @@ public static StringSearch getStringSearch(
    * Returns the collation ID for the given collation name.
    */
   public static int collationNameToId(String collationName) throws 
SparkException {
+    // If collation name is given as a fully qualified name, extract the 
actual collation name as
+    // the last part of the [catalog].[schema].[collation_name] name.
+    long numDots = collationName.chars().filter(ch -> ch == '.').count();
+    if (numDots > 0) {
+      String[] nameParts = collationName.split("\\.");
+      // Currently only predefined collations are supported.
+      if (numDots != 2 || 
!CollationFactory.CATALOG.equalsIgnoreCase(nameParts[0]) ||
+          !CollationFactory.SCHEMA.equalsIgnoreCase(nameParts[1])) {
+        throw 
CollationFactory.Collation.CollationSpec.collationInvalidNameException(collationName);

Review Comment:
   > then pass only the last part to our suggestions algorithm
   
   Not sure if this is the right approach here, since the error message would 
be confusing: 
   ```sql
   SELECT 'a' COLLATE SYSTEM.BUILTN.UTF8_LCASE
   -- [COLLATION_INVALID_NAME] The value UTF8_LCASE does not represent a 
correct collation name. Suggested valid collation names: [UTF8_LCASE]. 
SQLSTATE: 42704
   ```
   
   This is only for cases where the user specifies the wrong `catalog`.`schema` 
and not collation.



##########
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##########
@@ -1154,9 +1154,33 @@ public static StringSearch getStringSearch(
    * Returns the collation ID for the given collation name.
    */
   public static int collationNameToId(String collationName) throws 
SparkException {
+    // If collation name is given as a fully qualified name, extract the 
actual collation name as
+    // the last part of the [catalog].[schema].[collation_name] name.
+    long numDots = collationName.chars().filter(ch -> ch == '.').count();
+    if (numDots > 0) {
+      String[] nameParts = collationName.split("\\.");

Review Comment:
   Is there any function i can access from `spark/common`?



##########
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##########
@@ -1154,9 +1154,33 @@ public static StringSearch getStringSearch(
    * Returns the collation ID for the given collation name.
    */
   public static int collationNameToId(String collationName) throws 
SparkException {
+    // If collation name is given as a fully qualified name, extract the 
actual collation name as
+    // the last part of the [catalog].[schema].[collation_name] name.
+    long numDots = collationName.chars().filter(ch -> ch == '.').count();

Review Comment:
   `AttributeNameParser.parseAttributeName` would introduce a circular 
dependency. Do we have something similar in `spark/common`? Can i move this 
trait - @HyukjinKwon ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-50032][SQL] Allow use of fully qualified collation name [spark]

Reply via email to