Re: [PR] [SPARK-50206][SQL] Added separate collation id for UTF8_BINARY and non-collated strings [spark]

via GitHub Tue, 05 Nov 2024 02:06:56 -0800


stefankandic commented on code in PR #48737:
URL: https://github.com/apache/spark/pull/48737#discussion_r1829047522



##########
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##########
@@ -231,9 +231,10 @@ public Collation(
      * UTF8_BINARY collation ID binary layout:
      * bit 31-24: Zeroes.
      * bit 23-22: Zeroes, reserved for version.
-     * bit 21-19 Zeros, reserved for future trimmings.
-     * bit 18 0 = none, 1 = right trim.
-     * bit 17-3:  Zeroes.
+     * bit 21-19: Zeros, reserved for future trimmings.
+     * bit 18:    0 = none, 1 = right trim.
+     * bit 17:    0 = none, 1 = utf8_binary.
+     * bit 16-3:  Zeroes.
      * bit 2:     0, reserved for accent sensitivity.
      * bit 1:     0, reserved for uppercase and case-insensitive.
      * bit 0:     0 = case-sensitive, 1 = lowercase.

Review Comment:
   Do you think it would maybe make sense to have this new specifier be at the 
bit 0 and move the rest one place up? Then we would have this nice property of 
implicit utf8 being 0, explicit 1 and lcase having id=2



##########
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##########
@@ -1123,6 +1181,8 @@ public CollationIdentifier identifier() {
   public static final List<String> SUPPORTED_PROVIDERS = 
List.of(PROVIDER_SPARK, PROVIDER_ICU);
   public static final String COLLATION_PAD_ATTRIBUTE = "NO_PAD";
 
+  public static final int DEFAULT_COLLATION_ID =

Review Comment:
   I feel like this being public can introduce bugs in the future because 
someone just checks the id and doesn't call the function; can we make it 
private or remove it altogether?



##########
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##########
@@ -547,23 +587,41 @@ private static CollationSpecUTF8 fromCollationId(int 
collationId) {
         // Extract case sensitivity from collation ID.
         int caseConversionOrdinal = SpecifierUtils.getSpecValue(collationId,
           CASE_SENSITIVITY_OFFSET, CASE_SENSITIVITY_MASK);
+        // Extract utf8 binary collation type from collation ID.
+        int utf8BinaryCollationType = SpecifierUtils.getSpecValue(collationId,
+          UTF8BINARY_COLLATION_TYPE_OFFSET, UTF8BINARY_COLLATION_TYPE_MASK);
         // Extract space trimming from collation ID.
         int spaceTrimmingOrdinal = getSpaceTrimming(collationId).ordinal();
         assert(isValidCollationId(collationId));
         return new CollationSpecUTF8(
           CaseSensitivity.values()[caseConversionOrdinal],
+          Utf8BinaryCollationType.values()[utf8BinaryCollationType],
           SpaceTrimming.values()[spaceTrimmingOrdinal]);
       }
 
       private static boolean isValidCollationId(int collationId) {
-        collationId = SpecifierUtils.removeSpec(
-          collationId,
-          SPACE_TRIMMING_OFFSET,
-          SPACE_TRIMMING_MASK);
-        collationId = SpecifierUtils.removeSpec(
-          collationId,
-          CASE_SENSITIVITY_OFFSET,
-          CASE_SENSITIVITY_MASK);
+        if (SpecifierUtils.getSpecValue(collationId, 
UTF8BINARY_COLLATION_TYPE_OFFSET,
+          UTF8BINARY_COLLATION_TYPE_MASK) != 0) {

Review Comment:
   Why do we need to have this if statement here now? 



##########
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##########
@@ -231,9 +231,10 @@ public Collation(
      * UTF8_BINARY collation ID binary layout:
      * bit 31-24: Zeroes.
      * bit 23-22: Zeroes, reserved for version.
-     * bit 21-19 Zeros, reserved for future trimmings.
-     * bit 18 0 = none, 1 = right trim.
-     * bit 17-3:  Zeroes.
+     * bit 21-19: Zeros, reserved for future trimmings.
+     * bit 18:    0 = none, 1 = right trim.
+     * bit 17:    0 = none, 1 = utf8_binary.
+     * bit 16-3:  Zeroes.
      * bit 2:     0, reserved for accent sensitivity.
      * bit 1:     0, reserved for uppercase and case-insensitive.
      * bit 0:     0 = case-sensitive, 1 = lowercase.

Review Comment:
   then we don't have to change this:
   `case object StringType extends StringType(0)`



##########
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##########
@@ -394,7 +395,11 @@ private static Collation fetchCollation(int collationId) {
         // instance.
         assert (collationId >= 0 && getDefinitionOrigin(collationId)
           == DefinitionOrigin.PREDEFINED);
-        if (collationId == UTF8_BINARY_COLLATION_ID) {

Review Comment:
   There should be no such thing as non-collated string, each string is 
collated according to some rules/logic. Naming it like this would imply that 
utf8 binary is not the default collation when it fact it is. WDYT @srielau 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-50206][SQL] Added separate collation id for UTF8_BINARY and non-collated strings [spark]

Reply via email to