nikolamand-db commented on code in PR #46180:
URL: https://github.com/apache/spark/pull/46180#discussion_r1606959835


##########
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##########
@@ -118,76 +119,433 @@ public Collation(
     }
 
     /**
-     * Constructor with comparators that are inherited from the given collator.
+     * Collation id is defined as 32-bit integer.
+     * We specify binary layouts for different classes of collations.
+     * Classes of collations are differentiated by most significant 3 bits 
(bit 31, 30 and 29),
+     * bit 31 being most significant and bit 0 being least significant.
+     * ---
+     * INDETERMINATE collation id binary layout:
+     * bit 31-0: 1
+     * INDETERMINATE collation id is equal to -1
+     * ---
+     * user-defined collation id binary layout:
+     * bit 31:   0
+     * bit 30:   1
+     * bit 29-0: undefined, reserved for future use
+     * ---
+     * UTF8_BINARY collation id binary layout:
+     * bit 31-22: zeroes
+     * bit 21-18: zeroes, reserved for space trimming
+     * bit 17-16: zeroes, reserved for version
+     * bit 15-3:  zeroes
+     * bit 2:     0, reserved for accent sensitivity
+     * bit 1:     0, reserved for uppercase and case-insensitive
+     * bit 0:     0 = case-sensitive, 1 = lowercase
+     * ---
+     * ICU collation id binary layout:
+     * bit 31-30: zeroes
+     * bit 29:    1
+     * bit 28-24: zeroes
+     * bit 23-22: zeroes, reserved for version
+     * bit 21-18: zeroes, reserved for space trimming
+     * bit 17:    0 = case-sensitive, 1 = case-insensitive
+     * bit 16:    0 = accent-sensitive, 1 = accent-insensitive
+     * bit 15-14: zeroes, reserved for punctuation sensitivity
+     * bit 13-12: zeroes, reserved for first letter preference
+     * bit 11-0:  locale id as specified in `ICULocaleToId` mapping
+     * ---
+     * Some illustrative examples of collation name to id mapping:
+     * - UTF8_BINARY       -> 0
+     * - UTF8_BINARY_LCASE -> 1
+     * - UNICODE           -> 0x20000000
+     * - UNICODE_AI        -> 0x20010000
+     * - UNICODE_CI        -> 0x20020000
+     * - UNICODE_CI_AI     -> 0x20030000
+     * - af                -> 0x20000001
+     * - af_CI_AI          -> 0x20030001
      */
-    public Collation(
-        String collationName,
-        Collator collator,
-        String version,
-        boolean supportsBinaryEquality,
-        boolean supportsBinaryOrdering,
-        boolean supportsLowercaseEquality) {
-      this(
-        collationName,
-        collator,
-        (s1, s2) -> collator.compare(s1.toString(), s2.toString()),
-        version,
-        s -> (long)collator.getCollationKey(s.toString()).hashCode(),
-        supportsBinaryEquality,
-        supportsBinaryOrdering,
-        supportsLowercaseEquality);
+    private abstract static class CollationSpec {
+
+      private enum DefinitionOrigin {

Review Comment:
   Added multiple clarifications across `CollationFactory`, resolving this. If 
we need more clarification, let's discuss the individual problematic points in 
the code.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to