Re: [PR] [SPARK-46841][SQL] Add collation support for ICU locales and collation specifiers [spark]

via GitHub Wed, 22 May 2024 06:05:22 -0700


nikolamand-db commented on code in PR #46180:
URL: https://github.com/apache/spark/pull/46180#discussion_r1609915373



##########
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##########
@@ -173,26 +174,546 @@ public Collation(
     }
 
     /**
-     * Constructor with comparators that are inherited from the given collator.
+     * Collation ID is defined as 32-bit integer. We specify binary layouts 
for different classes of
+     * collations. Classes of collations are differentiated by most 
significant 3 bits (bit 31, 30
+     * and 29), bit 31 being most significant and bit 0 being least 
significant.
+     * ---
+     * General collation ID binary layout:
+     * bit 31:    1 for INDETERMINATE (requires all other bits to be 1 as 
well), 0 otherwise.
+     * bit 30:    0 for predefined, 1 for user-defined.
+     * Following bits are specified for predefined collations:
+     * bit 29:    0 for UTF8_BINARY, 1 for ICU collations.
+     * bit 28-24: Reserved.
+     * bit 23-22: Reserved for version.
+     * bit 21-18: Reserved for space trimming.
+     * bit 17-0:  Depend on collation family.
+     * ---
+     * INDETERMINATE collation ID binary layout:
+     * bit 31-0: 1
+     * INDETERMINATE collation ID is equal to -1.
+     * ---
+     * User-defined collation ID binary layout:
+     * bit 31:   0
+     * bit 30:   1
+     * bit 29-0: Undefined, reserved for future use.
+     * ---
+     * UTF8_BINARY collation ID binary layout:
+     * bit 31-24: Zeroes.
+     * bit 23-22: Zeroes, reserved for version.
+     * bit 21-18: Zeroes, reserved for space trimming.
+     * bit 17-3:  Zeroes.
+     * bit 2:     0, reserved for accent sensitivity.
+     * bit 1:     0, reserved for uppercase and case-insensitive.
+     * bit 0:     0 = case-sensitive, 1 = lowercase.
+     * ---
+     * ICU collation ID binary layout:

Review Comment:
   We did design binary layout. There are several reasons for having accent and 
case sensitivity bits in different positions:
   - case-sensitivity doesn't have same meaning in `UTF8_BINARY` and ICU 
collations; for ICU we only support `CS`/`CI` while in `UTF8_BINARY` we only 
have unspecified/`LCASE` for now, while `UCASE` and `CI` are reserved for 
future use
   - in order to preserve collation ids we're using before these changes for 
`UTF8_BINARY` (0) and `UTF8_BINARY_LCASE` (1) we needed to set aside the least 
significant bit for case-sensitivity while for ICU collations it makes more 
sense to use least significant bits for locale id



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-46841][SQL] Add collation support for ICU locales and collation specifiers [spark]

Reply via email to