mkaravel commented on code in PR #46180:
URL: https://github.com/apache/spark/pull/46180#discussion_r1605449462


##########
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##########
@@ -118,76 +119,433 @@ public Collation(
     }
 
     /**
-     * Constructor with comparators that are inherited from the given collator.
+     * Collation id is defined as 32-bit integer.
+     * We specify binary layouts for different classes of collations.
+     * Classes of collations are differentiated by most significant 3 bits 
(bit 31, 30 and 29),
+     * bit 31 being most significant and bit 0 being least significant.
+     * ---
+     * INDETERMINATE collation id binary layout:
+     * bit 31-0: 1
+     * INDETERMINATE collation id is equal to -1
+     * ---
+     * user-defined collation id binary layout:
+     * bit 31:   0
+     * bit 30:   1
+     * bit 29-0: undefined, reserved for future use
+     * ---
+     * UTF8_BINARY collation id binary layout:
+     * bit 31-22: zeroes
+     * bit 21-18: zeroes, reserved for space trimming
+     * bit 17-16: zeroes, reserved for version
+     * bit 15-3:  zeroes
+     * bit 2:     0, reserved for accent sensitivity
+     * bit 1:     0, reserved for uppercase and case-insensitive
+     * bit 0:     0 = case-sensitive, 1 = lowercase
+     * ---
+     * ICU collation id binary layout:
+     * bit 31-30: zeroes
+     * bit 29:    1
+     * bit 28-24: zeroes
+     * bit 23-22: zeroes, reserved for version

Review Comment:
   Is there a good reason why version appears at different positions for the 
UTF8_BINARY family and ICU collations?



##########
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##########
@@ -118,76 +119,433 @@ public Collation(
     }
 
     /**
-     * Constructor with comparators that are inherited from the given collator.
+     * Collation id is defined as 32-bit integer.
+     * We specify binary layouts for different classes of collations.
+     * Classes of collations are differentiated by most significant 3 bits 
(bit 31, 30 and 29),
+     * bit 31 being most significant and bit 0 being least significant.
+     * ---
+     * INDETERMINATE collation id binary layout:
+     * bit 31-0: 1
+     * INDETERMINATE collation id is equal to -1
+     * ---
+     * user-defined collation id binary layout:
+     * bit 31:   0
+     * bit 30:   1
+     * bit 29-0: undefined, reserved for future use
+     * ---
+     * UTF8_BINARY collation id binary layout:
+     * bit 31-22: zeroes
+     * bit 21-18: zeroes, reserved for space trimming
+     * bit 17-16: zeroes, reserved for version
+     * bit 15-3:  zeroes
+     * bit 2:     0, reserved for accent sensitivity
+     * bit 1:     0, reserved for uppercase and case-insensitive
+     * bit 0:     0 = case-sensitive, 1 = lowercase
+     * ---
+     * ICU collation id binary layout:
+     * bit 31-30: zeroes
+     * bit 29:    1
+     * bit 28-24: zeroes
+     * bit 23-22: zeroes, reserved for version
+     * bit 21-18: zeroes, reserved for space trimming
+     * bit 17:    0 = case-sensitive, 1 = case-insensitive
+     * bit 16:    0 = accent-sensitive, 1 = accent-insensitive
+     * bit 15-14: zeroes, reserved for punctuation sensitivity
+     * bit 13-12: zeroes, reserved for first letter preference
+     * bit 11-0:  locale id as specified in `ICULocaleToId` mapping
+     * ---
+     * Some illustrative examples of collation name to id mapping:
+     * - UTF8_BINARY       -> 0
+     * - UTF8_BINARY_LCASE -> 1
+     * - UNICODE           -> 0x20000000
+     * - UNICODE_AI        -> 0x20010000
+     * - UNICODE_CI        -> 0x20020000
+     * - UNICODE_CI_AI     -> 0x20030000
+     * - af                -> 0x20000001
+     * - af_CI_AI          -> 0x20030001
      */
-    public Collation(
-        String collationName,
-        Collator collator,
-        String version,
-        boolean supportsBinaryEquality,
-        boolean supportsBinaryOrdering,
-        boolean supportsLowercaseEquality) {
-      this(
-        collationName,
-        collator,
-        (s1, s2) -> collator.compare(s1.toString(), s2.toString()),
-        version,
-        s -> (long)collator.getCollationKey(s.toString()).hashCode(),
-        supportsBinaryEquality,
-        supportsBinaryOrdering,
-        supportsLowercaseEquality);
+    private abstract static class CollationSpec {
+
+      private enum DefinitionOrigin {

Review Comment:
   I do not want to block this PR for this, but I find the lack of comments for 
variables, inner classes, methods, etc. quite problematic. The only way to 
understand what is going on is by looking at the code, which could be okay in 
some cases, but it definitely does not help in understanding how the different 
pieces fit together.



##########
common/unsafe/src/test/scala/org/apache/spark/unsafe/types/CollationFactorySuite.scala:
##########
@@ -152,4 +219,218 @@ class CollationFactorySuite extends AnyFunSuite with 
Matchers { // scalastyle:ig
       }
     })
   }
+
+  test("test collation caching") {
+    Seq(
+      "UTF8_BINARY",
+      "UTF8_BINARY_LCASE",
+      "UNICODE",
+      "UNICODE_CI",
+      "UNICODE_AI",
+      "UNICODE_CI_AI",
+      "UNICODE_AI_CI"
+    ).foreach(collationId => {
+      val col1 = fetchCollation(collationId)
+      val col2 = fetchCollation(collationId)
+      assert(col1 eq col2) // reference equality
+    })
+  }
+
+  test("collations with ICU non-root localization") {
+    Seq(
+      // language only
+      "en",
+      "en_CS",
+      "en_CI",
+      "en_AS",
+      "en_AI",
+      // language + 3-letter country code
+      "en_USA",
+      "en_USA_CS",
+      "en_USA_CI",
+      "en_USA_AS",
+      "en_USA_AI",
+      // language + script code
+      "sr_Cyrl",
+      "sr_Cyrl_CS",
+      "sr_Cyrl_CI",
+      "sr_Cyrl_AS",
+      "sr_Cyrl_AI",
+      // language + script code + 3-letter country code
+      "sr_Cyrl_SRB",
+      "sr_Cyrl_SRB_CS",
+      "sr_Cyrl_SRB_CI",
+      "sr_Cyrl_SRB_AS",
+      "sr_Cyrl_SRB_AI"
+    ).foreach(collationICU => {
+      val col = fetchCollation(collationICU)
+      assert(col.collator.getLocale(ULocale.VALID_LOCALE) != ULocale.ROOT)
+    })
+  }
+
+  test("invalid names of collations with ICU non-root localization") {

Review Comment:
   Same ask here for conflicting specifiers.



##########
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##########
@@ -118,76 +119,433 @@ public Collation(
     }
 
     /**
-     * Constructor with comparators that are inherited from the given collator.
+     * Collation id is defined as 32-bit integer.
+     * We specify binary layouts for different classes of collations.
+     * Classes of collations are differentiated by most significant 3 bits 
(bit 31, 30 and 29),
+     * bit 31 being most significant and bit 0 being least significant.
+     * ---
+     * INDETERMINATE collation id binary layout:
+     * bit 31-0: 1
+     * INDETERMINATE collation id is equal to -1
+     * ---
+     * user-defined collation id binary layout:
+     * bit 31:   0
+     * bit 30:   1
+     * bit 29-0: undefined, reserved for future use
+     * ---
+     * UTF8_BINARY collation id binary layout:
+     * bit 31-22: zeroes
+     * bit 21-18: zeroes, reserved for space trimming
+     * bit 17-16: zeroes, reserved for version
+     * bit 15-3:  zeroes
+     * bit 2:     0, reserved for accent sensitivity
+     * bit 1:     0, reserved for uppercase and case-insensitive
+     * bit 0:     0 = case-sensitive, 1 = lowercase
+     * ---
+     * ICU collation id binary layout:
+     * bit 31-30: zeroes
+     * bit 29:    1
+     * bit 28-24: zeroes
+     * bit 23-22: zeroes, reserved for version
+     * bit 21-18: zeroes, reserved for space trimming
+     * bit 17:    0 = case-sensitive, 1 = case-insensitive
+     * bit 16:    0 = accent-sensitive, 1 = accent-insensitive
+     * bit 15-14: zeroes, reserved for punctuation sensitivity
+     * bit 13-12: zeroes, reserved for first letter preference
+     * bit 11-0:  locale id as specified in `ICULocaleToId` mapping
+     * ---
+     * Some illustrative examples of collation name to id mapping:
+     * - UTF8_BINARY       -> 0
+     * - UTF8_BINARY_LCASE -> 1
+     * - UNICODE           -> 0x20000000
+     * - UNICODE_AI        -> 0x20010000
+     * - UNICODE_CI        -> 0x20020000
+     * - UNICODE_CI_AI     -> 0x20030000
+     * - af                -> 0x20000001
+     * - af_CI_AI          -> 0x20030001
      */
-    public Collation(
-        String collationName,
-        Collator collator,
-        String version,
-        boolean supportsBinaryEquality,
-        boolean supportsBinaryOrdering,
-        boolean supportsLowercaseEquality) {
-      this(
-        collationName,
-        collator,
-        (s1, s2) -> collator.compare(s1.toString(), s2.toString()),
-        version,
-        s -> (long)collator.getCollationKey(s.toString()).hashCode(),
-        supportsBinaryEquality,
-        supportsBinaryOrdering,
-        supportsLowercaseEquality);
+    private abstract static class CollationSpec {
+
+      private enum DefinitionOrigin {
+        PREDEFINED, USER_DEFINED
+      }
+
+      protected enum ImplementationProvider {
+        UTF8_BINARY, ICU
+      }
+
+      private static final int DEFINITION_ORIGIN_OFFSET = 30;
+      private static final int DEFINITION_ORIGIN_MASK = 0b1;
+      protected static final int IMPLEMENTATION_PROVIDER_OFFSET = 29;
+      protected static final int IMPLEMENTATION_PROVIDER_MASK = 0b1;
+
+      private static final int INDETERMINATE_COLLATION_ID = -1;
+
+      private static final Map<Integer, Collation> collationMap = new 
ConcurrentHashMap<>();
+
+      private static ImplementationProvider getImplementationProvider(int 
collationId) {
+        return 
ImplementationProvider.values()[SpecifierUtils.getSpecValue(collationId,
+          IMPLEMENTATION_PROVIDER_OFFSET, IMPLEMENTATION_PROVIDER_MASK)];
+      }
+
+      private static DefinitionOrigin getDefinitionOrigin(int collationId) {
+        return 
DefinitionOrigin.values()[SpecifierUtils.getSpecValue(collationId,
+          DEFINITION_ORIGIN_OFFSET, DEFINITION_ORIGIN_MASK)];
+      }
+
+      private static Collation fetchCollation(int collationId) {
+        assert (collationId >= 0 && getDefinitionOrigin(collationId)
+          == DefinitionOrigin.PREDEFINED);
+        if (collationId == UTF8_BINARY_COLLATION_ID) {
+          return CollationSpecUTF8Binary.UTF8_BINARY_COLLATION;
+        } else if (collationMap.containsKey(collationId)) {
+          return collationMap.get(collationId);
+        } else {
+          CollationSpec spec;
+          ImplementationProvider implementationProvider = 
getImplementationProvider(collationId);
+          if (implementationProvider == ImplementationProvider.UTF8_BINARY) {
+            spec = CollationSpecUTF8Binary.fromCollationId(collationId);
+          } else {
+            spec = CollationSpecICU.fromCollationId(collationId);
+          }
+          Collation collation = spec.buildCollation();
+          collationMap.put(collationId, collation);
+          return collation;
+        }
+      }
+
+      protected static SparkException collationInvalidNameException(String 
collationName) {
+        return new SparkException("COLLATION_INVALID_NAME",
+          SparkException.constructMessageParams(Map.of("collationName", 
collationName)), null);
+      }
+
+      private static int collationNameToId(String collationName) throws 
SparkException {
+        String collationNameUpper = collationName.toUpperCase();
+        if (collationNameUpper.startsWith("UTF8_BINARY")) {
+          return CollationSpecUTF8Binary.collationNameToId(collationName, 
collationNameUpper);
+        } else {
+          return CollationSpecICU.collationNameToId(collationName, 
collationNameUpper);
+        }
+      }
+
+      protected abstract Collation buildCollation();
     }
-  }
 
-  private static final Collation[] collationTable = new Collation[4];
-  private static final HashMap<String, Integer> collationNameToIdMap = new 
HashMap<>();
-
-  public static final int UTF8_BINARY_COLLATION_ID = 0;
-  public static final int UTF8_BINARY_LCASE_COLLATION_ID = 1;
-
-  static {
-    // Binary comparison. This is the default collation.
-    // No custom comparators will be used for this collation.
-    // Instead, we rely on byte for byte comparison.
-    collationTable[0] = new Collation(
-      "UTF8_BINARY",
-      null,
-      UTF8String::binaryCompare,
-      "1.0",
-      s -> (long)s.hashCode(),
-      true,
-      true,
-      false);
-
-    // Case-insensitive UTF8 binary collation.
-    // TODO: Do in place comparisons instead of creating new strings.
-    collationTable[1] = new Collation(
-      "UTF8_BINARY_LCASE",
-      null,
-      UTF8String::compareLowerCase,
-      "1.0",
-      (s) -> (long)s.toLowerCase().hashCode(),
-      false,
-      false,
-      true);
-
-    // UNICODE case sensitive comparison (ROOT locale, in ICU).
-    collationTable[2] = new Collation(
-      "UNICODE", Collator.getInstance(ULocale.ROOT), "153.120.0.0", true, 
false, false);
-    collationTable[2].collator.setStrength(Collator.TERTIARY);
-    collationTable[2].collator.freeze();
-
-    // UNICODE case-insensitive comparison (ROOT locale, in ICU + Secondary 
strength).
-    collationTable[3] = new Collation(
-      "UNICODE_CI", Collator.getInstance(ULocale.ROOT), "153.120.0.0", false, 
false, false);
-    collationTable[3].collator.setStrength(Collator.SECONDARY);
-    collationTable[3].collator.freeze();
-
-    for (int i = 0; i < collationTable.length; i++) {
-      collationNameToIdMap.put(collationTable[i].collationName, i);
+    private static class CollationSpecUTF8Binary extends CollationSpec {
+
+      private static final int CASE_SENSITIVITY_OFFSET = 0;
+      private static final int CASE_SENSITIVITY_MASK = 0b1;
+
+      private enum CaseSensitivity {
+        UNSPECIFIED, LCASE
+      }
+
+      private static final int UTF8_BINARY_COLLATION_ID =
+        new CollationSpecUTF8Binary(CaseSensitivity.UNSPECIFIED).collationId;
+      private static final int UTF8_BINARY_LCASE_COLLATION_ID =
+        new CollationSpecUTF8Binary(CaseSensitivity.LCASE).collationId;
+      protected static Collation UTF8_BINARY_COLLATION =
+        new 
CollationSpecUTF8Binary(CaseSensitivity.UNSPECIFIED).buildCollation();
+      protected static Collation UTF8_BINARY_LCASE_COLLATION =
+        new CollationSpecUTF8Binary(CaseSensitivity.LCASE).buildCollation();
+
+      private final int collationId;
+
+      private CollationSpecUTF8Binary(CaseSensitivity caseSensitivity) {
+        this.collationId =
+          SpecifierUtils.setSpecValue(0, CASE_SENSITIVITY_OFFSET, 
caseSensitivity);
+      }
+
+      private static int collationNameToId(String originalName, String 
collationName)
+          throws SparkException {
+        if (UTF8_BINARY_COLLATION.collationName.equals(collationName)) {
+          return UTF8_BINARY_COLLATION_ID;
+        } else if 
(UTF8_BINARY_LCASE_COLLATION.collationName.equals(collationName)) {
+          return UTF8_BINARY_LCASE_COLLATION_ID;
+        } else {
+          throw collationInvalidNameException(originalName);
+        }
+      }
+
+      private static CollationSpecUTF8Binary fromCollationId(int collationId) {
+        int caseConversionOrdinal = SpecifierUtils.getSpecValue(collationId,
+          CASE_SENSITIVITY_OFFSET, CASE_SENSITIVITY_MASK);
+        assert (SpecifierUtils.removeSpec(collationId,
+          CASE_SENSITIVITY_OFFSET, CASE_SENSITIVITY_MASK) == 0);
+        return new 
CollationSpecUTF8Binary(CaseSensitivity.values()[caseConversionOrdinal]);
+      }
+
+      @Override
+      protected Collation buildCollation() {
+        if (collationId == UTF8_BINARY_COLLATION_ID) {
+          return new Collation("UTF8_BINARY", null, UTF8String::binaryCompare, 
"1.0",
+            s -> (long) s.hashCode(), true, true, false);
+        } else {
+          return new Collation("UTF8_BINARY_LCASE", null, 
UTF8String::compareLowerCase, "1.0",
+            s -> (long) s.toLowerCase().hashCode(), false, false, true);
+        }
+      }
+    }
+
+    private static class CollationSpecICU extends CollationSpec {
+
+      private enum CaseSensitivity {
+        CS, CI
+      }
+
+      private enum AccentSensitivity {
+        AS, AI
+      }
+
+      private static final int CASE_SENSITIVITY_OFFSET = 17;
+      private static final int CASE_SENSITIVITY_MASK = 0b1;
+      private static final int ACCENT_SENSITIVITY_OFFSET = 16;
+      private static final int ACCENT_SENSITIVITY_MASK = 0b1;
+
+      // Array of locale names, each locale id corresponds to the index in 
this array
+      private static final String[] ICULocaleNames;
+
+      // Mapping of locale names to corresponding `ULocale` instance
+      private static final Map<String, ULocale> ICULocaleMap = new HashMap<>();
+
+      // Used to parse user input collation names which are converted to 
uppercase
+      private static final Map<String, String> ICULocaleMapUppercase = new 
HashMap<>();
+
+      // Reverse mapping of `ICULocaleNames`
+      private static final Map<String, Integer> ICULocaleToId = new 
HashMap<>();
+
+      private static final String ICU_COLLATOR_VERSION = "153.120.0.0";
+
+      static {
+        ICULocaleMap.put("UNICODE", ULocale.ROOT);
+        ULocale[] locales = Collator.getAvailableULocales();
+        for (ULocale locale : locales) {
+          if (locale.getVariant().isEmpty()) {
+            String language = locale.getLanguage();
+            assert (!language.isEmpty());
+            StringBuilder builder = new StringBuilder(language);
+            String script = locale.getScript();
+            if (!script.isEmpty()) {
+              builder.append('_');
+              builder.append(script);
+            }
+            String country = locale.getISO3Country();
+            if (!country.isEmpty()) {
+              builder.append('_');
+              builder.append(country);
+            }
+            String localeName = builder.toString();
+            // locale names are unique
+            assert (!ICULocaleMap.containsKey(localeName));
+            ICULocaleMap.put(localeName, locale);
+          }
+        }
+        for (String localeName : ICULocaleMap.keySet()) {
+          String localeUppercase = localeName.toUpperCase();
+          // locale names are unique case-insensitively
+          assert (!ICULocaleMapUppercase.containsKey(localeUppercase));
+          ICULocaleMapUppercase.put(localeUppercase, localeName);
+        }
+        ICULocaleNames = ICULocaleMap.keySet().toArray(new String[0]);
+        Arrays.sort(ICULocaleNames);
+        // maximum number of locale ids as defined by binary layout

Review Comment:
   ```suggestion
           // Maximum number of locale IDs as defined by binary layout.
   ```



##########
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##########
@@ -118,76 +119,433 @@ public Collation(
     }
 
     /**
-     * Constructor with comparators that are inherited from the given collator.
+     * Collation id is defined as 32-bit integer.
+     * We specify binary layouts for different classes of collations.
+     * Classes of collations are differentiated by most significant 3 bits 
(bit 31, 30 and 29),
+     * bit 31 being most significant and bit 0 being least significant.
+     * ---
+     * INDETERMINATE collation id binary layout:
+     * bit 31-0: 1
+     * INDETERMINATE collation id is equal to -1
+     * ---
+     * user-defined collation id binary layout:
+     * bit 31:   0
+     * bit 30:   1
+     * bit 29-0: undefined, reserved for future use
+     * ---
+     * UTF8_BINARY collation id binary layout:
+     * bit 31-22: zeroes
+     * bit 21-18: zeroes, reserved for space trimming
+     * bit 17-16: zeroes, reserved for version
+     * bit 15-3:  zeroes
+     * bit 2:     0, reserved for accent sensitivity
+     * bit 1:     0, reserved for uppercase and case-insensitive
+     * bit 0:     0 = case-sensitive, 1 = lowercase
+     * ---
+     * ICU collation id binary layout:
+     * bit 31-30: zeroes
+     * bit 29:    1
+     * bit 28-24: zeroes
+     * bit 23-22: zeroes, reserved for version
+     * bit 21-18: zeroes, reserved for space trimming
+     * bit 17:    0 = case-sensitive, 1 = case-insensitive
+     * bit 16:    0 = accent-sensitive, 1 = accent-insensitive
+     * bit 15-14: zeroes, reserved for punctuation sensitivity
+     * bit 13-12: zeroes, reserved for first letter preference
+     * bit 11-0:  locale id as specified in `ICULocaleToId` mapping
+     * ---
+     * Some illustrative examples of collation name to id mapping:
+     * - UTF8_BINARY       -> 0
+     * - UTF8_BINARY_LCASE -> 1
+     * - UNICODE           -> 0x20000000
+     * - UNICODE_AI        -> 0x20010000
+     * - UNICODE_CI        -> 0x20020000
+     * - UNICODE_CI_AI     -> 0x20030000
+     * - af                -> 0x20000001
+     * - af_CI_AI          -> 0x20030001
      */
-    public Collation(
-        String collationName,
-        Collator collator,
-        String version,
-        boolean supportsBinaryEquality,
-        boolean supportsBinaryOrdering,
-        boolean supportsLowercaseEquality) {
-      this(
-        collationName,
-        collator,
-        (s1, s2) -> collator.compare(s1.toString(), s2.toString()),
-        version,
-        s -> (long)collator.getCollationKey(s.toString()).hashCode(),
-        supportsBinaryEquality,
-        supportsBinaryOrdering,
-        supportsLowercaseEquality);
+    private abstract static class CollationSpec {
+
+      private enum DefinitionOrigin {
+        PREDEFINED, USER_DEFINED
+      }
+
+      protected enum ImplementationProvider {
+        UTF8_BINARY, ICU
+      }
+
+      private static final int DEFINITION_ORIGIN_OFFSET = 30;
+      private static final int DEFINITION_ORIGIN_MASK = 0b1;
+      protected static final int IMPLEMENTATION_PROVIDER_OFFSET = 29;
+      protected static final int IMPLEMENTATION_PROVIDER_MASK = 0b1;
+
+      private static final int INDETERMINATE_COLLATION_ID = -1;
+
+      private static final Map<Integer, Collation> collationMap = new 
ConcurrentHashMap<>();
+
+      private static ImplementationProvider getImplementationProvider(int 
collationId) {
+        return 
ImplementationProvider.values()[SpecifierUtils.getSpecValue(collationId,
+          IMPLEMENTATION_PROVIDER_OFFSET, IMPLEMENTATION_PROVIDER_MASK)];
+      }
+
+      private static DefinitionOrigin getDefinitionOrigin(int collationId) {
+        return 
DefinitionOrigin.values()[SpecifierUtils.getSpecValue(collationId,
+          DEFINITION_ORIGIN_OFFSET, DEFINITION_ORIGIN_MASK)];
+      }
+
+      private static Collation fetchCollation(int collationId) {
+        assert (collationId >= 0 && getDefinitionOrigin(collationId)
+          == DefinitionOrigin.PREDEFINED);
+        if (collationId == UTF8_BINARY_COLLATION_ID) {
+          return CollationSpecUTF8Binary.UTF8_BINARY_COLLATION;
+        } else if (collationMap.containsKey(collationId)) {
+          return collationMap.get(collationId);
+        } else {
+          CollationSpec spec;
+          ImplementationProvider implementationProvider = 
getImplementationProvider(collationId);
+          if (implementationProvider == ImplementationProvider.UTF8_BINARY) {
+            spec = CollationSpecUTF8Binary.fromCollationId(collationId);
+          } else {
+            spec = CollationSpecICU.fromCollationId(collationId);
+          }
+          Collation collation = spec.buildCollation();
+          collationMap.put(collationId, collation);
+          return collation;
+        }
+      }
+
+      protected static SparkException collationInvalidNameException(String 
collationName) {
+        return new SparkException("COLLATION_INVALID_NAME",
+          SparkException.constructMessageParams(Map.of("collationName", 
collationName)), null);
+      }
+
+      private static int collationNameToId(String collationName) throws 
SparkException {
+        String collationNameUpper = collationName.toUpperCase();
+        if (collationNameUpper.startsWith("UTF8_BINARY")) {
+          return CollationSpecUTF8Binary.collationNameToId(collationName, 
collationNameUpper);
+        } else {
+          return CollationSpecICU.collationNameToId(collationName, 
collationNameUpper);
+        }
+      }
+
+      protected abstract Collation buildCollation();
     }
-  }
 
-  private static final Collation[] collationTable = new Collation[4];
-  private static final HashMap<String, Integer> collationNameToIdMap = new 
HashMap<>();
-
-  public static final int UTF8_BINARY_COLLATION_ID = 0;
-  public static final int UTF8_BINARY_LCASE_COLLATION_ID = 1;
-
-  static {
-    // Binary comparison. This is the default collation.
-    // No custom comparators will be used for this collation.
-    // Instead, we rely on byte for byte comparison.
-    collationTable[0] = new Collation(
-      "UTF8_BINARY",
-      null,
-      UTF8String::binaryCompare,
-      "1.0",
-      s -> (long)s.hashCode(),
-      true,
-      true,
-      false);
-
-    // Case-insensitive UTF8 binary collation.
-    // TODO: Do in place comparisons instead of creating new strings.
-    collationTable[1] = new Collation(
-      "UTF8_BINARY_LCASE",
-      null,
-      UTF8String::compareLowerCase,
-      "1.0",
-      (s) -> (long)s.toLowerCase().hashCode(),
-      false,
-      false,
-      true);
-
-    // UNICODE case sensitive comparison (ROOT locale, in ICU).
-    collationTable[2] = new Collation(
-      "UNICODE", Collator.getInstance(ULocale.ROOT), "153.120.0.0", true, 
false, false);
-    collationTable[2].collator.setStrength(Collator.TERTIARY);
-    collationTable[2].collator.freeze();
-
-    // UNICODE case-insensitive comparison (ROOT locale, in ICU + Secondary 
strength).
-    collationTable[3] = new Collation(
-      "UNICODE_CI", Collator.getInstance(ULocale.ROOT), "153.120.0.0", false, 
false, false);
-    collationTable[3].collator.setStrength(Collator.SECONDARY);
-    collationTable[3].collator.freeze();
-
-    for (int i = 0; i < collationTable.length; i++) {
-      collationNameToIdMap.put(collationTable[i].collationName, i);
+    private static class CollationSpecUTF8Binary extends CollationSpec {
+
+      private static final int CASE_SENSITIVITY_OFFSET = 0;
+      private static final int CASE_SENSITIVITY_MASK = 0b1;
+
+      private enum CaseSensitivity {
+        UNSPECIFIED, LCASE
+      }
+
+      private static final int UTF8_BINARY_COLLATION_ID =
+        new CollationSpecUTF8Binary(CaseSensitivity.UNSPECIFIED).collationId;
+      private static final int UTF8_BINARY_LCASE_COLLATION_ID =
+        new CollationSpecUTF8Binary(CaseSensitivity.LCASE).collationId;
+      protected static Collation UTF8_BINARY_COLLATION =
+        new 
CollationSpecUTF8Binary(CaseSensitivity.UNSPECIFIED).buildCollation();
+      protected static Collation UTF8_BINARY_LCASE_COLLATION =
+        new CollationSpecUTF8Binary(CaseSensitivity.LCASE).buildCollation();
+
+      private final int collationId;
+
+      private CollationSpecUTF8Binary(CaseSensitivity caseSensitivity) {
+        this.collationId =
+          SpecifierUtils.setSpecValue(0, CASE_SENSITIVITY_OFFSET, 
caseSensitivity);
+      }
+
+      private static int collationNameToId(String originalName, String 
collationName)
+          throws SparkException {
+        if (UTF8_BINARY_COLLATION.collationName.equals(collationName)) {
+          return UTF8_BINARY_COLLATION_ID;
+        } else if 
(UTF8_BINARY_LCASE_COLLATION.collationName.equals(collationName)) {
+          return UTF8_BINARY_LCASE_COLLATION_ID;
+        } else {
+          throw collationInvalidNameException(originalName);
+        }
+      }
+
+      private static CollationSpecUTF8Binary fromCollationId(int collationId) {
+        int caseConversionOrdinal = SpecifierUtils.getSpecValue(collationId,
+          CASE_SENSITIVITY_OFFSET, CASE_SENSITIVITY_MASK);
+        assert (SpecifierUtils.removeSpec(collationId,
+          CASE_SENSITIVITY_OFFSET, CASE_SENSITIVITY_MASK) == 0);
+        return new 
CollationSpecUTF8Binary(CaseSensitivity.values()[caseConversionOrdinal]);
+      }
+
+      @Override
+      protected Collation buildCollation() {
+        if (collationId == UTF8_BINARY_COLLATION_ID) {
+          return new Collation("UTF8_BINARY", null, UTF8String::binaryCompare, 
"1.0",
+            s -> (long) s.hashCode(), true, true, false);
+        } else {
+          return new Collation("UTF8_BINARY_LCASE", null, 
UTF8String::compareLowerCase, "1.0",
+            s -> (long) s.toLowerCase().hashCode(), false, false, true);
+        }
+      }
+    }
+
+    private static class CollationSpecICU extends CollationSpec {
+
+      private enum CaseSensitivity {
+        CS, CI
+      }
+
+      private enum AccentSensitivity {
+        AS, AI
+      }
+
+      private static final int CASE_SENSITIVITY_OFFSET = 17;
+      private static final int CASE_SENSITIVITY_MASK = 0b1;
+      private static final int ACCENT_SENSITIVITY_OFFSET = 16;
+      private static final int ACCENT_SENSITIVITY_MASK = 0b1;
+
+      // Array of locale names, each locale id corresponds to the index in 
this array
+      private static final String[] ICULocaleNames;
+
+      // Mapping of locale names to corresponding `ULocale` instance
+      private static final Map<String, ULocale> ICULocaleMap = new HashMap<>();
+
+      // Used to parse user input collation names which are converted to 
uppercase
+      private static final Map<String, String> ICULocaleMapUppercase = new 
HashMap<>();
+
+      // Reverse mapping of `ICULocaleNames`
+      private static final Map<String, Integer> ICULocaleToId = new 
HashMap<>();
+
+      private static final String ICU_COLLATOR_VERSION = "153.120.0.0";
+
+      static {
+        ICULocaleMap.put("UNICODE", ULocale.ROOT);
+        ULocale[] locales = Collator.getAvailableULocales();
+        for (ULocale locale : locales) {
+          if (locale.getVariant().isEmpty()) {
+            String language = locale.getLanguage();
+            assert (!language.isEmpty());
+            StringBuilder builder = new StringBuilder(language);
+            String script = locale.getScript();
+            if (!script.isEmpty()) {
+              builder.append('_');
+              builder.append(script);
+            }
+            String country = locale.getISO3Country();
+            if (!country.isEmpty()) {
+              builder.append('_');
+              builder.append(country);
+            }
+            String localeName = builder.toString();
+            // locale names are unique
+            assert (!ICULocaleMap.containsKey(localeName));
+            ICULocaleMap.put(localeName, locale);
+          }
+        }
+        for (String localeName : ICULocaleMap.keySet()) {
+          String localeUppercase = localeName.toUpperCase();
+          // locale names are unique case-insensitively
+          assert (!ICULocaleMapUppercase.containsKey(localeUppercase));
+          ICULocaleMapUppercase.put(localeUppercase, localeName);
+        }
+        ICULocaleNames = ICULocaleMap.keySet().toArray(new String[0]);
+        Arrays.sort(ICULocaleNames);
+        // maximum number of locale ids as defined by binary layout
+        assert (ICULocaleNames.length <= (1 << 12));
+        for (int i = 0; i < ICULocaleNames.length; ++i) {
+          ICULocaleToId.put(ICULocaleNames[i], i);
+        }
+      }
+
+      private static final int UNICODE_COLLATION_ID =
+        new CollationSpecICU("UNICODE", CaseSensitivity.CS, 
AccentSensitivity.AS).collationId;
+      private static final int UNICODE_CI_COLLATION_ID =
+        new CollationSpecICU("UNICODE", CaseSensitivity.CI, 
AccentSensitivity.AS).collationId;
+
+      private final CaseSensitivity caseSensitivity;
+      private final AccentSensitivity accentSensitivity;
+      private final String locale;
+      private final int collationId;
+
+      private CollationSpecICU(String locale, CaseSensitivity caseSensitivity,
+          AccentSensitivity accentSensitivity) {
+        this.locale = locale;
+        this.caseSensitivity = caseSensitivity;
+        this.accentSensitivity = accentSensitivity;
+        int collationId = ICULocaleToId.get(locale);
+        collationId = SpecifierUtils.setSpecValue(collationId, 
IMPLEMENTATION_PROVIDER_OFFSET,
+          ImplementationProvider.ICU);
+        collationId = SpecifierUtils.setSpecValue(collationId, 
CASE_SENSITIVITY_OFFSET,
+          caseSensitivity);
+        collationId = SpecifierUtils.setSpecValue(collationId, 
ACCENT_SENSITIVITY_OFFSET,
+          accentSensitivity);
+        this.collationId = collationId;
+      }
+
+      private static int collationNameToId(
+          String originalName, String collationName) throws SparkException {
+        // search for the longest locale match because specifiers are designed 
to be different from
+        // script tag and country code, meaning the only valid locale name 
match can be
+        // the longest one
+        int lastPos = -1;
+        for (int i = 1; i <= collationName.length(); i++) {
+          String localeName = collationName.substring(0, i);
+          if (ICULocaleMapUppercase.containsKey(localeName)) {
+            lastPos = i;
+          }
+        }
+        if (lastPos == -1) {
+          throw collationInvalidNameException(originalName);
+        } else {
+          String locale = collationName.substring(0, lastPos);
+          int collationId = 
ICULocaleToId.get(ICULocaleMapUppercase.get(locale));
+
+          // try all combinations of AS/AI and CS/CI
+          CaseSensitivity caseSensitivity;
+          AccentSensitivity accentSensitivity;
+          if (collationName.equals(locale) ||
+              collationName.equals(locale + "_AS") ||
+              collationName.equals(locale + "_CS") ||
+              collationName.equals(locale + "_AS_CS") ||
+              collationName.equals(locale + "_CS_AS")
+          ) {
+            caseSensitivity = CaseSensitivity.CS;
+            accentSensitivity = AccentSensitivity.AS;
+          } else if (collationName.equals(locale + "_CI") ||
+              collationName.equals(locale + "_AS_CI") ||
+              collationName.equals(locale + "_CI_AS")) {
+            caseSensitivity = CaseSensitivity.CI;
+            accentSensitivity = AccentSensitivity.AS;
+          } else if (collationName.equals(locale + "_AI") ||
+              collationName.equals(locale + "_CS_AI") ||
+              collationName.equals(locale + "_AI_CS")) {
+            caseSensitivity = CaseSensitivity.CS;
+            accentSensitivity = AccentSensitivity.AI;
+          } else if (collationName.equals(locale + "_AI_CI") ||
+              collationName.equals(locale + "_CI_AI")) {
+            caseSensitivity = CaseSensitivity.CI;
+            accentSensitivity = AccentSensitivity.AI;

Review Comment:
   When we have trimming, do we expect a combinatorial blow up on that we have 
to test for equality, to do we plan to follow a string substring search 
approach?



##########
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##########
@@ -118,76 +119,433 @@ public Collation(
     }
 
     /**
-     * Constructor with comparators that are inherited from the given collator.
+     * Collation id is defined as 32-bit integer.
+     * We specify binary layouts for different classes of collations.
+     * Classes of collations are differentiated by most significant 3 bits 
(bit 31, 30 and 29),
+     * bit 31 being most significant and bit 0 being least significant.
+     * ---
+     * INDETERMINATE collation id binary layout:
+     * bit 31-0: 1
+     * INDETERMINATE collation id is equal to -1
+     * ---
+     * user-defined collation id binary layout:
+     * bit 31:   0
+     * bit 30:   1
+     * bit 29-0: undefined, reserved for future use
+     * ---
+     * UTF8_BINARY collation id binary layout:
+     * bit 31-22: zeroes
+     * bit 21-18: zeroes, reserved for space trimming
+     * bit 17-16: zeroes, reserved for version
+     * bit 15-3:  zeroes
+     * bit 2:     0, reserved for accent sensitivity
+     * bit 1:     0, reserved for uppercase and case-insensitive
+     * bit 0:     0 = case-sensitive, 1 = lowercase
+     * ---
+     * ICU collation id binary layout:
+     * bit 31-30: zeroes
+     * bit 29:    1
+     * bit 28-24: zeroes
+     * bit 23-22: zeroes, reserved for version
+     * bit 21-18: zeroes, reserved for space trimming
+     * bit 17:    0 = case-sensitive, 1 = case-insensitive
+     * bit 16:    0 = accent-sensitive, 1 = accent-insensitive
+     * bit 15-14: zeroes, reserved for punctuation sensitivity
+     * bit 13-12: zeroes, reserved for first letter preference
+     * bit 11-0:  locale id as specified in `ICULocaleToId` mapping
+     * ---
+     * Some illustrative examples of collation name to id mapping:
+     * - UTF8_BINARY       -> 0
+     * - UTF8_BINARY_LCASE -> 1
+     * - UNICODE           -> 0x20000000
+     * - UNICODE_AI        -> 0x20010000
+     * - UNICODE_CI        -> 0x20020000
+     * - UNICODE_CI_AI     -> 0x20030000
+     * - af                -> 0x20000001
+     * - af_CI_AI          -> 0x20030001
      */
-    public Collation(
-        String collationName,
-        Collator collator,
-        String version,
-        boolean supportsBinaryEquality,
-        boolean supportsBinaryOrdering,
-        boolean supportsLowercaseEquality) {
-      this(
-        collationName,
-        collator,
-        (s1, s2) -> collator.compare(s1.toString(), s2.toString()),
-        version,
-        s -> (long)collator.getCollationKey(s.toString()).hashCode(),
-        supportsBinaryEquality,
-        supportsBinaryOrdering,
-        supportsLowercaseEquality);
+    private abstract static class CollationSpec {
+
+      private enum DefinitionOrigin {
+        PREDEFINED, USER_DEFINED
+      }
+
+      protected enum ImplementationProvider {
+        UTF8_BINARY, ICU
+      }
+
+      private static final int DEFINITION_ORIGIN_OFFSET = 30;
+      private static final int DEFINITION_ORIGIN_MASK = 0b1;
+      protected static final int IMPLEMENTATION_PROVIDER_OFFSET = 29;
+      protected static final int IMPLEMENTATION_PROVIDER_MASK = 0b1;
+
+      private static final int INDETERMINATE_COLLATION_ID = -1;
+
+      private static final Map<Integer, Collation> collationMap = new 
ConcurrentHashMap<>();
+
+      private static ImplementationProvider getImplementationProvider(int 
collationId) {
+        return 
ImplementationProvider.values()[SpecifierUtils.getSpecValue(collationId,
+          IMPLEMENTATION_PROVIDER_OFFSET, IMPLEMENTATION_PROVIDER_MASK)];
+      }
+
+      private static DefinitionOrigin getDefinitionOrigin(int collationId) {
+        return 
DefinitionOrigin.values()[SpecifierUtils.getSpecValue(collationId,
+          DEFINITION_ORIGIN_OFFSET, DEFINITION_ORIGIN_MASK)];
+      }
+
+      private static Collation fetchCollation(int collationId) {
+        assert (collationId >= 0 && getDefinitionOrigin(collationId)
+          == DefinitionOrigin.PREDEFINED);
+        if (collationId == UTF8_BINARY_COLLATION_ID) {
+          return CollationSpecUTF8Binary.UTF8_BINARY_COLLATION;
+        } else if (collationMap.containsKey(collationId)) {
+          return collationMap.get(collationId);
+        } else {
+          CollationSpec spec;
+          ImplementationProvider implementationProvider = 
getImplementationProvider(collationId);
+          if (implementationProvider == ImplementationProvider.UTF8_BINARY) {
+            spec = CollationSpecUTF8Binary.fromCollationId(collationId);
+          } else {
+            spec = CollationSpecICU.fromCollationId(collationId);
+          }
+          Collation collation = spec.buildCollation();
+          collationMap.put(collationId, collation);
+          return collation;
+        }
+      }
+
+      protected static SparkException collationInvalidNameException(String 
collationName) {
+        return new SparkException("COLLATION_INVALID_NAME",
+          SparkException.constructMessageParams(Map.of("collationName", 
collationName)), null);
+      }
+
+      private static int collationNameToId(String collationName) throws 
SparkException {
+        String collationNameUpper = collationName.toUpperCase();
+        if (collationNameUpper.startsWith("UTF8_BINARY")) {
+          return CollationSpecUTF8Binary.collationNameToId(collationName, 
collationNameUpper);
+        } else {
+          return CollationSpecICU.collationNameToId(collationName, 
collationNameUpper);
+        }
+      }
+
+      protected abstract Collation buildCollation();
     }
-  }
 
-  private static final Collation[] collationTable = new Collation[4];
-  private static final HashMap<String, Integer> collationNameToIdMap = new 
HashMap<>();
-
-  public static final int UTF8_BINARY_COLLATION_ID = 0;
-  public static final int UTF8_BINARY_LCASE_COLLATION_ID = 1;
-
-  static {
-    // Binary comparison. This is the default collation.
-    // No custom comparators will be used for this collation.
-    // Instead, we rely on byte for byte comparison.
-    collationTable[0] = new Collation(
-      "UTF8_BINARY",
-      null,
-      UTF8String::binaryCompare,
-      "1.0",
-      s -> (long)s.hashCode(),
-      true,
-      true,
-      false);
-
-    // Case-insensitive UTF8 binary collation.
-    // TODO: Do in place comparisons instead of creating new strings.
-    collationTable[1] = new Collation(
-      "UTF8_BINARY_LCASE",
-      null,
-      UTF8String::compareLowerCase,
-      "1.0",
-      (s) -> (long)s.toLowerCase().hashCode(),
-      false,
-      false,
-      true);
-
-    // UNICODE case sensitive comparison (ROOT locale, in ICU).
-    collationTable[2] = new Collation(
-      "UNICODE", Collator.getInstance(ULocale.ROOT), "153.120.0.0", true, 
false, false);
-    collationTable[2].collator.setStrength(Collator.TERTIARY);
-    collationTable[2].collator.freeze();
-
-    // UNICODE case-insensitive comparison (ROOT locale, in ICU + Secondary 
strength).
-    collationTable[3] = new Collation(
-      "UNICODE_CI", Collator.getInstance(ULocale.ROOT), "153.120.0.0", false, 
false, false);
-    collationTable[3].collator.setStrength(Collator.SECONDARY);
-    collationTable[3].collator.freeze();
-
-    for (int i = 0; i < collationTable.length; i++) {
-      collationNameToIdMap.put(collationTable[i].collationName, i);
+    private static class CollationSpecUTF8Binary extends CollationSpec {
+
+      private static final int CASE_SENSITIVITY_OFFSET = 0;
+      private static final int CASE_SENSITIVITY_MASK = 0b1;
+
+      private enum CaseSensitivity {
+        UNSPECIFIED, LCASE
+      }
+
+      private static final int UTF8_BINARY_COLLATION_ID =
+        new CollationSpecUTF8Binary(CaseSensitivity.UNSPECIFIED).collationId;
+      private static final int UTF8_BINARY_LCASE_COLLATION_ID =
+        new CollationSpecUTF8Binary(CaseSensitivity.LCASE).collationId;
+      protected static Collation UTF8_BINARY_COLLATION =
+        new 
CollationSpecUTF8Binary(CaseSensitivity.UNSPECIFIED).buildCollation();
+      protected static Collation UTF8_BINARY_LCASE_COLLATION =
+        new CollationSpecUTF8Binary(CaseSensitivity.LCASE).buildCollation();
+
+      private final int collationId;
+
+      private CollationSpecUTF8Binary(CaseSensitivity caseSensitivity) {
+        this.collationId =
+          SpecifierUtils.setSpecValue(0, CASE_SENSITIVITY_OFFSET, 
caseSensitivity);
+      }
+
+      private static int collationNameToId(String originalName, String 
collationName)
+          throws SparkException {
+        if (UTF8_BINARY_COLLATION.collationName.equals(collationName)) {
+          return UTF8_BINARY_COLLATION_ID;
+        } else if 
(UTF8_BINARY_LCASE_COLLATION.collationName.equals(collationName)) {
+          return UTF8_BINARY_LCASE_COLLATION_ID;
+        } else {
+          throw collationInvalidNameException(originalName);
+        }
+      }
+
+      private static CollationSpecUTF8Binary fromCollationId(int collationId) {
+        int caseConversionOrdinal = SpecifierUtils.getSpecValue(collationId,
+          CASE_SENSITIVITY_OFFSET, CASE_SENSITIVITY_MASK);
+        assert (SpecifierUtils.removeSpec(collationId,
+          CASE_SENSITIVITY_OFFSET, CASE_SENSITIVITY_MASK) == 0);
+        return new 
CollationSpecUTF8Binary(CaseSensitivity.values()[caseConversionOrdinal]);
+      }
+
+      @Override
+      protected Collation buildCollation() {
+        if (collationId == UTF8_BINARY_COLLATION_ID) {
+          return new Collation("UTF8_BINARY", null, UTF8String::binaryCompare, 
"1.0",
+            s -> (long) s.hashCode(), true, true, false);
+        } else {
+          return new Collation("UTF8_BINARY_LCASE", null, 
UTF8String::compareLowerCase, "1.0",
+            s -> (long) s.toLowerCase().hashCode(), false, false, true);
+        }
+      }
+    }
+
+    private static class CollationSpecICU extends CollationSpec {
+
+      private enum CaseSensitivity {
+        CS, CI
+      }
+
+      private enum AccentSensitivity {
+        AS, AI
+      }
+
+      private static final int CASE_SENSITIVITY_OFFSET = 17;
+      private static final int CASE_SENSITIVITY_MASK = 0b1;
+      private static final int ACCENT_SENSITIVITY_OFFSET = 16;
+      private static final int ACCENT_SENSITIVITY_MASK = 0b1;
+
+      // Array of locale names, each locale id corresponds to the index in 
this array
+      private static final String[] ICULocaleNames;
+
+      // Mapping of locale names to corresponding `ULocale` instance
+      private static final Map<String, ULocale> ICULocaleMap = new HashMap<>();
+
+      // Used to parse user input collation names which are converted to 
uppercase
+      private static final Map<String, String> ICULocaleMapUppercase = new 
HashMap<>();
+
+      // Reverse mapping of `ICULocaleNames`
+      private static final Map<String, Integer> ICULocaleToId = new 
HashMap<>();
+
+      private static final String ICU_COLLATOR_VERSION = "153.120.0.0";
+
+      static {
+        ICULocaleMap.put("UNICODE", ULocale.ROOT);
+        ULocale[] locales = Collator.getAvailableULocales();
+        for (ULocale locale : locales) {
+          if (locale.getVariant().isEmpty()) {
+            String language = locale.getLanguage();
+            assert (!language.isEmpty());
+            StringBuilder builder = new StringBuilder(language);
+            String script = locale.getScript();
+            if (!script.isEmpty()) {
+              builder.append('_');
+              builder.append(script);
+            }
+            String country = locale.getISO3Country();
+            if (!country.isEmpty()) {
+              builder.append('_');
+              builder.append(country);
+            }
+            String localeName = builder.toString();
+            // locale names are unique
+            assert (!ICULocaleMap.containsKey(localeName));
+            ICULocaleMap.put(localeName, locale);
+          }
+        }
+        for (String localeName : ICULocaleMap.keySet()) {
+          String localeUppercase = localeName.toUpperCase();
+          // locale names are unique case-insensitively

Review Comment:
   ```suggestion
             // Locale names are unique case-insensitively.
   ```



##########
common/unsafe/src/test/scala/org/apache/spark/unsafe/types/CollationFactorySuite.scala:
##########
@@ -30,31 +33,95 @@ import org.apache.spark.unsafe.types.UTF8String.{fromString 
=> toUTF8}
 
 class CollationFactorySuite extends AnyFunSuite with Matchers { // 
scalastyle:ignore funsuite
   test("collationId stability") {
-    val utf8Binary = fetchCollation(0)
+    assert(INDETERMINATE_COLLATION_ID == -1)
+
+    assert(UTF8_BINARY_COLLATION_ID == 0)
+    val utf8Binary = fetchCollation(UTF8_BINARY_COLLATION_ID)
     assert(utf8Binary.collationName == "UTF8_BINARY")
     assert(utf8Binary.supportsBinaryEquality)
 
-    val utf8BinaryLcase = fetchCollation(1)
+    assert(UTF8_BINARY_LCASE_COLLATION_ID == 1)
+    val utf8BinaryLcase = fetchCollation(UTF8_BINARY_LCASE_COLLATION_ID)
     assert(utf8BinaryLcase.collationName == "UTF8_BINARY_LCASE")
     assert(!utf8BinaryLcase.supportsBinaryEquality)
 
-    val unicode = fetchCollation(2)
+    assert(UNICODE_COLLATION_ID == (1 << 29))
+    val unicode = fetchCollation(UNICODE_COLLATION_ID)
     assert(unicode.collationName == "UNICODE")
-    assert(unicode.supportsBinaryEquality);
+    assert(unicode.supportsBinaryEquality)
 
-    val unicodeCi = fetchCollation(3)
+    assert(UNICODE_CI_COLLATION_ID == ((1 << 29) | (1 << 17)))
+    val unicodeCi = fetchCollation(UNICODE_CI_COLLATION_ID)
     assert(unicodeCi.collationName == "UNICODE_CI")
     assert(!unicodeCi.supportsBinaryEquality)
   }
 
-  test("fetch invalid collation name") {
-    val error = intercept[SparkException] {
-      fetchCollation("UTF8_BS")
+  test("UTF8_BINARY and ICU root locale collation names") {
+    // collation name already normalized
+    Seq(
+      "UTF8_BINARY",
+      "UTF8_BINARY_LCASE",
+      "UNICODE",
+      "UNICODE_CI",
+      "UNICODE_AI",
+      "UNICODE_CI_AI"
+    ).foreach(collationName => {
+      val col = fetchCollation(collationName)
+      assert(col.collationName == collationName)
+    })
+    // collation name normalization
+    Seq(
+      // ICU root locale
+      ("UNICODE_CS", "UNICODE"),
+      ("UNICODE_CS_AS", "UNICODE"),
+      ("UNICODE_CI_AS", "UNICODE_CI"),
+      ("UNICODE_AI_CS", "UNICODE_AI"),
+      ("UNICODE_AI_CI", "UNICODE_CI_AI"),
+      // randomized case collation names
+      ("utf8_binary", "UTF8_BINARY"),
+      ("UtF8_binARy_LcasE", "UTF8_BINARY_LCASE"),
+      ("unicode", "UNICODE"),
+      ("UnICoDe_cs_aI", "UNICODE_AI")
+    ).foreach{
+      case (name, normalized) =>
+        val col = fetchCollation(name)
+        assert(col.collationName == normalized)
     }
+  }
+
+  test("fetch invalid UTF8_BINARY and ICU root locale collation names") {

Review Comment:
   Can we add test cases where we specify conflicting specifiers? For example 
"UNICODE_CI_CS" or "UNICODE_AS_AI".



##########
common/unsafe/src/test/scala/org/apache/spark/unsafe/types/CollationFactorySuite.scala:
##########
@@ -152,4 +219,218 @@ class CollationFactorySuite extends AnyFunSuite with 
Matchers { // scalastyle:ig
       }
     })
   }
+
+  test("test collation caching") {
+    Seq(
+      "UTF8_BINARY",
+      "UTF8_BINARY_LCASE",
+      "UNICODE",
+      "UNICODE_CI",
+      "UNICODE_AI",
+      "UNICODE_CI_AI",
+      "UNICODE_AI_CI"
+    ).foreach(collationId => {
+      val col1 = fetchCollation(collationId)
+      val col2 = fetchCollation(collationId)
+      assert(col1 eq col2) // reference equality
+    })
+  }
+
+  test("collations with ICU non-root localization") {
+    Seq(
+      // language only
+      "en",
+      "en_CS",
+      "en_CI",
+      "en_AS",
+      "en_AI",
+      // language + 3-letter country code
+      "en_USA",
+      "en_USA_CS",
+      "en_USA_CI",
+      "en_USA_AS",
+      "en_USA_AI",
+      // language + script code
+      "sr_Cyrl",
+      "sr_Cyrl_CS",
+      "sr_Cyrl_CI",
+      "sr_Cyrl_AS",
+      "sr_Cyrl_AI",
+      // language + script code + 3-letter country code
+      "sr_Cyrl_SRB",
+      "sr_Cyrl_SRB_CS",
+      "sr_Cyrl_SRB_CI",
+      "sr_Cyrl_SRB_AS",
+      "sr_Cyrl_SRB_AI"
+    ).foreach(collationICU => {
+      val col = fetchCollation(collationICU)
+      assert(col.collator.getLocale(ULocale.VALID_LOCALE) != ULocale.ROOT)
+    })
+  }
+
+  test("invalid names of collations with ICU non-root localization") {

Review Comment:
   Also test cases where the script or the country code are not at the right 
position?



##########
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##########
@@ -118,76 +119,433 @@ public Collation(
     }
 
     /**
-     * Constructor with comparators that are inherited from the given collator.
+     * Collation id is defined as 32-bit integer.
+     * We specify binary layouts for different classes of collations.
+     * Classes of collations are differentiated by most significant 3 bits 
(bit 31, 30 and 29),
+     * bit 31 being most significant and bit 0 being least significant.
+     * ---
+     * INDETERMINATE collation id binary layout:
+     * bit 31-0: 1
+     * INDETERMINATE collation id is equal to -1
+     * ---
+     * user-defined collation id binary layout:
+     * bit 31:   0
+     * bit 30:   1
+     * bit 29-0: undefined, reserved for future use
+     * ---
+     * UTF8_BINARY collation id binary layout:
+     * bit 31-22: zeroes
+     * bit 21-18: zeroes, reserved for space trimming
+     * bit 17-16: zeroes, reserved for version
+     * bit 15-3:  zeroes
+     * bit 2:     0, reserved for accent sensitivity
+     * bit 1:     0, reserved for uppercase and case-insensitive
+     * bit 0:     0 = case-sensitive, 1 = lowercase
+     * ---
+     * ICU collation id binary layout:
+     * bit 31-30: zeroes
+     * bit 29:    1
+     * bit 28-24: zeroes
+     * bit 23-22: zeroes, reserved for version
+     * bit 21-18: zeroes, reserved for space trimming
+     * bit 17:    0 = case-sensitive, 1 = case-insensitive
+     * bit 16:    0 = accent-sensitive, 1 = accent-insensitive
+     * bit 15-14: zeroes, reserved for punctuation sensitivity
+     * bit 13-12: zeroes, reserved for first letter preference
+     * bit 11-0:  locale id as specified in `ICULocaleToId` mapping
+     * ---
+     * Some illustrative examples of collation name to id mapping:
+     * - UTF8_BINARY       -> 0
+     * - UTF8_BINARY_LCASE -> 1
+     * - UNICODE           -> 0x20000000
+     * - UNICODE_AI        -> 0x20010000
+     * - UNICODE_CI        -> 0x20020000
+     * - UNICODE_CI_AI     -> 0x20030000
+     * - af                -> 0x20000001
+     * - af_CI_AI          -> 0x20030001
      */
-    public Collation(
-        String collationName,
-        Collator collator,
-        String version,
-        boolean supportsBinaryEquality,
-        boolean supportsBinaryOrdering,
-        boolean supportsLowercaseEquality) {
-      this(
-        collationName,
-        collator,
-        (s1, s2) -> collator.compare(s1.toString(), s2.toString()),
-        version,
-        s -> (long)collator.getCollationKey(s.toString()).hashCode(),
-        supportsBinaryEquality,
-        supportsBinaryOrdering,
-        supportsLowercaseEquality);
+    private abstract static class CollationSpec {
+
+      private enum DefinitionOrigin {
+        PREDEFINED, USER_DEFINED
+      }
+
+      protected enum ImplementationProvider {
+        UTF8_BINARY, ICU
+      }
+
+      private static final int DEFINITION_ORIGIN_OFFSET = 30;
+      private static final int DEFINITION_ORIGIN_MASK = 0b1;
+      protected static final int IMPLEMENTATION_PROVIDER_OFFSET = 29;
+      protected static final int IMPLEMENTATION_PROVIDER_MASK = 0b1;
+
+      private static final int INDETERMINATE_COLLATION_ID = -1;
+
+      private static final Map<Integer, Collation> collationMap = new 
ConcurrentHashMap<>();
+
+      private static ImplementationProvider getImplementationProvider(int 
collationId) {
+        return 
ImplementationProvider.values()[SpecifierUtils.getSpecValue(collationId,
+          IMPLEMENTATION_PROVIDER_OFFSET, IMPLEMENTATION_PROVIDER_MASK)];
+      }
+
+      private static DefinitionOrigin getDefinitionOrigin(int collationId) {
+        return 
DefinitionOrigin.values()[SpecifierUtils.getSpecValue(collationId,
+          DEFINITION_ORIGIN_OFFSET, DEFINITION_ORIGIN_MASK)];
+      }
+
+      private static Collation fetchCollation(int collationId) {
+        assert (collationId >= 0 && getDefinitionOrigin(collationId)
+          == DefinitionOrigin.PREDEFINED);
+        if (collationId == UTF8_BINARY_COLLATION_ID) {
+          return CollationSpecUTF8Binary.UTF8_BINARY_COLLATION;
+        } else if (collationMap.containsKey(collationId)) {
+          return collationMap.get(collationId);
+        } else {
+          CollationSpec spec;
+          ImplementationProvider implementationProvider = 
getImplementationProvider(collationId);
+          if (implementationProvider == ImplementationProvider.UTF8_BINARY) {
+            spec = CollationSpecUTF8Binary.fromCollationId(collationId);
+          } else {
+            spec = CollationSpecICU.fromCollationId(collationId);
+          }
+          Collation collation = spec.buildCollation();
+          collationMap.put(collationId, collation);
+          return collation;
+        }
+      }
+
+      protected static SparkException collationInvalidNameException(String 
collationName) {
+        return new SparkException("COLLATION_INVALID_NAME",
+          SparkException.constructMessageParams(Map.of("collationName", 
collationName)), null);
+      }
+
+      private static int collationNameToId(String collationName) throws 
SparkException {
+        String collationNameUpper = collationName.toUpperCase();
+        if (collationNameUpper.startsWith("UTF8_BINARY")) {
+          return CollationSpecUTF8Binary.collationNameToId(collationName, 
collationNameUpper);
+        } else {
+          return CollationSpecICU.collationNameToId(collationName, 
collationNameUpper);
+        }
+      }
+
+      protected abstract Collation buildCollation();
     }
-  }
 
-  private static final Collation[] collationTable = new Collation[4];
-  private static final HashMap<String, Integer> collationNameToIdMap = new 
HashMap<>();
-
-  public static final int UTF8_BINARY_COLLATION_ID = 0;
-  public static final int UTF8_BINARY_LCASE_COLLATION_ID = 1;
-
-  static {
-    // Binary comparison. This is the default collation.
-    // No custom comparators will be used for this collation.
-    // Instead, we rely on byte for byte comparison.
-    collationTable[0] = new Collation(
-      "UTF8_BINARY",
-      null,
-      UTF8String::binaryCompare,
-      "1.0",
-      s -> (long)s.hashCode(),
-      true,
-      true,
-      false);
-
-    // Case-insensitive UTF8 binary collation.
-    // TODO: Do in place comparisons instead of creating new strings.
-    collationTable[1] = new Collation(
-      "UTF8_BINARY_LCASE",
-      null,
-      UTF8String::compareLowerCase,
-      "1.0",
-      (s) -> (long)s.toLowerCase().hashCode(),
-      false,
-      false,
-      true);
-
-    // UNICODE case sensitive comparison (ROOT locale, in ICU).
-    collationTable[2] = new Collation(
-      "UNICODE", Collator.getInstance(ULocale.ROOT), "153.120.0.0", true, 
false, false);
-    collationTable[2].collator.setStrength(Collator.TERTIARY);
-    collationTable[2].collator.freeze();
-
-    // UNICODE case-insensitive comparison (ROOT locale, in ICU + Secondary 
strength).
-    collationTable[3] = new Collation(
-      "UNICODE_CI", Collator.getInstance(ULocale.ROOT), "153.120.0.0", false, 
false, false);
-    collationTable[3].collator.setStrength(Collator.SECONDARY);
-    collationTable[3].collator.freeze();
-
-    for (int i = 0; i < collationTable.length; i++) {
-      collationNameToIdMap.put(collationTable[i].collationName, i);
+    private static class CollationSpecUTF8Binary extends CollationSpec {
+
+      private static final int CASE_SENSITIVITY_OFFSET = 0;
+      private static final int CASE_SENSITIVITY_MASK = 0b1;
+
+      private enum CaseSensitivity {
+        UNSPECIFIED, LCASE
+      }
+
+      private static final int UTF8_BINARY_COLLATION_ID =
+        new CollationSpecUTF8Binary(CaseSensitivity.UNSPECIFIED).collationId;
+      private static final int UTF8_BINARY_LCASE_COLLATION_ID =
+        new CollationSpecUTF8Binary(CaseSensitivity.LCASE).collationId;
+      protected static Collation UTF8_BINARY_COLLATION =
+        new 
CollationSpecUTF8Binary(CaseSensitivity.UNSPECIFIED).buildCollation();
+      protected static Collation UTF8_BINARY_LCASE_COLLATION =
+        new CollationSpecUTF8Binary(CaseSensitivity.LCASE).buildCollation();
+
+      private final int collationId;
+
+      private CollationSpecUTF8Binary(CaseSensitivity caseSensitivity) {
+        this.collationId =
+          SpecifierUtils.setSpecValue(0, CASE_SENSITIVITY_OFFSET, 
caseSensitivity);
+      }
+
+      private static int collationNameToId(String originalName, String 
collationName)
+          throws SparkException {
+        if (UTF8_BINARY_COLLATION.collationName.equals(collationName)) {
+          return UTF8_BINARY_COLLATION_ID;
+        } else if 
(UTF8_BINARY_LCASE_COLLATION.collationName.equals(collationName)) {
+          return UTF8_BINARY_LCASE_COLLATION_ID;
+        } else {
+          throw collationInvalidNameException(originalName);
+        }
+      }
+
+      private static CollationSpecUTF8Binary fromCollationId(int collationId) {
+        int caseConversionOrdinal = SpecifierUtils.getSpecValue(collationId,
+          CASE_SENSITIVITY_OFFSET, CASE_SENSITIVITY_MASK);
+        assert (SpecifierUtils.removeSpec(collationId,
+          CASE_SENSITIVITY_OFFSET, CASE_SENSITIVITY_MASK) == 0);
+        return new 
CollationSpecUTF8Binary(CaseSensitivity.values()[caseConversionOrdinal]);
+      }
+
+      @Override
+      protected Collation buildCollation() {
+        if (collationId == UTF8_BINARY_COLLATION_ID) {
+          return new Collation("UTF8_BINARY", null, UTF8String::binaryCompare, 
"1.0",
+            s -> (long) s.hashCode(), true, true, false);
+        } else {
+          return new Collation("UTF8_BINARY_LCASE", null, 
UTF8String::compareLowerCase, "1.0",
+            s -> (long) s.toLowerCase().hashCode(), false, false, true);
+        }
+      }
+    }
+
+    private static class CollationSpecICU extends CollationSpec {
+
+      private enum CaseSensitivity {
+        CS, CI
+      }
+
+      private enum AccentSensitivity {
+        AS, AI
+      }
+
+      private static final int CASE_SENSITIVITY_OFFSET = 17;
+      private static final int CASE_SENSITIVITY_MASK = 0b1;
+      private static final int ACCENT_SENSITIVITY_OFFSET = 16;
+      private static final int ACCENT_SENSITIVITY_MASK = 0b1;
+
+      // Array of locale names, each locale id corresponds to the index in 
this array
+      private static final String[] ICULocaleNames;
+
+      // Mapping of locale names to corresponding `ULocale` instance
+      private static final Map<String, ULocale> ICULocaleMap = new HashMap<>();
+
+      // Used to parse user input collation names which are converted to 
uppercase
+      private static final Map<String, String> ICULocaleMapUppercase = new 
HashMap<>();
+
+      // Reverse mapping of `ICULocaleNames`
+      private static final Map<String, Integer> ICULocaleToId = new 
HashMap<>();
+
+      private static final String ICU_COLLATOR_VERSION = "153.120.0.0";
+
+      static {
+        ICULocaleMap.put("UNICODE", ULocale.ROOT);
+        ULocale[] locales = Collator.getAvailableULocales();
+        for (ULocale locale : locales) {
+          if (locale.getVariant().isEmpty()) {
+            String language = locale.getLanguage();
+            assert (!language.isEmpty());
+            StringBuilder builder = new StringBuilder(language);
+            String script = locale.getScript();
+            if (!script.isEmpty()) {
+              builder.append('_');
+              builder.append(script);
+            }
+            String country = locale.getISO3Country();
+            if (!country.isEmpty()) {
+              builder.append('_');
+              builder.append(country);
+            }
+            String localeName = builder.toString();
+            // locale names are unique
+            assert (!ICULocaleMap.containsKey(localeName));
+            ICULocaleMap.put(localeName, locale);
+          }
+        }
+        for (String localeName : ICULocaleMap.keySet()) {
+          String localeUppercase = localeName.toUpperCase();
+          // locale names are unique case-insensitively
+          assert (!ICULocaleMapUppercase.containsKey(localeUppercase));
+          ICULocaleMapUppercase.put(localeUppercase, localeName);
+        }
+        ICULocaleNames = ICULocaleMap.keySet().toArray(new String[0]);
+        Arrays.sort(ICULocaleNames);
+        // maximum number of locale ids as defined by binary layout
+        assert (ICULocaleNames.length <= (1 << 12));
+        for (int i = 0; i < ICULocaleNames.length; ++i) {
+          ICULocaleToId.put(ICULocaleNames[i], i);
+        }
+      }
+
+      private static final int UNICODE_COLLATION_ID =
+        new CollationSpecICU("UNICODE", CaseSensitivity.CS, 
AccentSensitivity.AS).collationId;
+      private static final int UNICODE_CI_COLLATION_ID =
+        new CollationSpecICU("UNICODE", CaseSensitivity.CI, 
AccentSensitivity.AS).collationId;
+
+      private final CaseSensitivity caseSensitivity;
+      private final AccentSensitivity accentSensitivity;
+      private final String locale;
+      private final int collationId;
+
+      private CollationSpecICU(String locale, CaseSensitivity caseSensitivity,
+          AccentSensitivity accentSensitivity) {
+        this.locale = locale;
+        this.caseSensitivity = caseSensitivity;
+        this.accentSensitivity = accentSensitivity;
+        int collationId = ICULocaleToId.get(locale);
+        collationId = SpecifierUtils.setSpecValue(collationId, 
IMPLEMENTATION_PROVIDER_OFFSET,
+          ImplementationProvider.ICU);
+        collationId = SpecifierUtils.setSpecValue(collationId, 
CASE_SENSITIVITY_OFFSET,
+          caseSensitivity);
+        collationId = SpecifierUtils.setSpecValue(collationId, 
ACCENT_SENSITIVITY_OFFSET,
+          accentSensitivity);
+        this.collationId = collationId;
+      }
+
+      private static int collationNameToId(
+          String originalName, String collationName) throws SparkException {
+        // search for the longest locale match because specifiers are designed 
to be different from
+        // script tag and country code, meaning the only valid locale name 
match can be
+        // the longest one
+        int lastPos = -1;
+        for (int i = 1; i <= collationName.length(); i++) {
+          String localeName = collationName.substring(0, i);
+          if (ICULocaleMapUppercase.containsKey(localeName)) {
+            lastPos = i;
+          }
+        }
+        if (lastPos == -1) {
+          throw collationInvalidNameException(originalName);
+        } else {
+          String locale = collationName.substring(0, lastPos);
+          int collationId = 
ICULocaleToId.get(ICULocaleMapUppercase.get(locale));
+
+          // try all combinations of AS/AI and CS/CI

Review Comment:
   ```suggestion
             // Try all combinations of AS/AI and CS/CI.
   ```



##########
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##########
@@ -118,76 +119,433 @@ public Collation(
     }
 
     /**
-     * Constructor with comparators that are inherited from the given collator.
+     * Collation id is defined as 32-bit integer.
+     * We specify binary layouts for different classes of collations.
+     * Classes of collations are differentiated by most significant 3 bits 
(bit 31, 30 and 29),
+     * bit 31 being most significant and bit 0 being least significant.
+     * ---
+     * INDETERMINATE collation id binary layout:
+     * bit 31-0: 1
+     * INDETERMINATE collation id is equal to -1
+     * ---
+     * user-defined collation id binary layout:
+     * bit 31:   0
+     * bit 30:   1
+     * bit 29-0: undefined, reserved for future use
+     * ---
+     * UTF8_BINARY collation id binary layout:
+     * bit 31-22: zeroes
+     * bit 21-18: zeroes, reserved for space trimming
+     * bit 17-16: zeroes, reserved for version
+     * bit 15-3:  zeroes
+     * bit 2:     0, reserved for accent sensitivity
+     * bit 1:     0, reserved for uppercase and case-insensitive
+     * bit 0:     0 = case-sensitive, 1 = lowercase
+     * ---
+     * ICU collation id binary layout:
+     * bit 31-30: zeroes
+     * bit 29:    1
+     * bit 28-24: zeroes
+     * bit 23-22: zeroes, reserved for version
+     * bit 21-18: zeroes, reserved for space trimming
+     * bit 17:    0 = case-sensitive, 1 = case-insensitive
+     * bit 16:    0 = accent-sensitive, 1 = accent-insensitive
+     * bit 15-14: zeroes, reserved for punctuation sensitivity
+     * bit 13-12: zeroes, reserved for first letter preference
+     * bit 11-0:  locale id as specified in `ICULocaleToId` mapping
+     * ---
+     * Some illustrative examples of collation name to id mapping:
+     * - UTF8_BINARY       -> 0
+     * - UTF8_BINARY_LCASE -> 1
+     * - UNICODE           -> 0x20000000
+     * - UNICODE_AI        -> 0x20010000
+     * - UNICODE_CI        -> 0x20020000
+     * - UNICODE_CI_AI     -> 0x20030000
+     * - af                -> 0x20000001
+     * - af_CI_AI          -> 0x20030001
      */
-    public Collation(
-        String collationName,
-        Collator collator,
-        String version,
-        boolean supportsBinaryEquality,
-        boolean supportsBinaryOrdering,
-        boolean supportsLowercaseEquality) {
-      this(
-        collationName,
-        collator,
-        (s1, s2) -> collator.compare(s1.toString(), s2.toString()),
-        version,
-        s -> (long)collator.getCollationKey(s.toString()).hashCode(),
-        supportsBinaryEquality,
-        supportsBinaryOrdering,
-        supportsLowercaseEquality);
+    private abstract static class CollationSpec {
+
+      private enum DefinitionOrigin {
+        PREDEFINED, USER_DEFINED
+      }
+
+      protected enum ImplementationProvider {
+        UTF8_BINARY, ICU
+      }
+
+      private static final int DEFINITION_ORIGIN_OFFSET = 30;
+      private static final int DEFINITION_ORIGIN_MASK = 0b1;
+      protected static final int IMPLEMENTATION_PROVIDER_OFFSET = 29;
+      protected static final int IMPLEMENTATION_PROVIDER_MASK = 0b1;
+
+      private static final int INDETERMINATE_COLLATION_ID = -1;
+
+      private static final Map<Integer, Collation> collationMap = new 
ConcurrentHashMap<>();
+
+      private static ImplementationProvider getImplementationProvider(int 
collationId) {
+        return 
ImplementationProvider.values()[SpecifierUtils.getSpecValue(collationId,
+          IMPLEMENTATION_PROVIDER_OFFSET, IMPLEMENTATION_PROVIDER_MASK)];
+      }
+
+      private static DefinitionOrigin getDefinitionOrigin(int collationId) {
+        return 
DefinitionOrigin.values()[SpecifierUtils.getSpecValue(collationId,
+          DEFINITION_ORIGIN_OFFSET, DEFINITION_ORIGIN_MASK)];
+      }
+
+      private static Collation fetchCollation(int collationId) {
+        assert (collationId >= 0 && getDefinitionOrigin(collationId)
+          == DefinitionOrigin.PREDEFINED);
+        if (collationId == UTF8_BINARY_COLLATION_ID) {
+          return CollationSpecUTF8Binary.UTF8_BINARY_COLLATION;
+        } else if (collationMap.containsKey(collationId)) {
+          return collationMap.get(collationId);
+        } else {
+          CollationSpec spec;
+          ImplementationProvider implementationProvider = 
getImplementationProvider(collationId);
+          if (implementationProvider == ImplementationProvider.UTF8_BINARY) {
+            spec = CollationSpecUTF8Binary.fromCollationId(collationId);
+          } else {
+            spec = CollationSpecICU.fromCollationId(collationId);
+          }
+          Collation collation = spec.buildCollation();
+          collationMap.put(collationId, collation);
+          return collation;
+        }
+      }
+
+      protected static SparkException collationInvalidNameException(String 
collationName) {
+        return new SparkException("COLLATION_INVALID_NAME",
+          SparkException.constructMessageParams(Map.of("collationName", 
collationName)), null);
+      }
+
+      private static int collationNameToId(String collationName) throws 
SparkException {
+        String collationNameUpper = collationName.toUpperCase();
+        if (collationNameUpper.startsWith("UTF8_BINARY")) {
+          return CollationSpecUTF8Binary.collationNameToId(collationName, 
collationNameUpper);
+        } else {
+          return CollationSpecICU.collationNameToId(collationName, 
collationNameUpper);
+        }
+      }
+
+      protected abstract Collation buildCollation();
     }
-  }
 
-  private static final Collation[] collationTable = new Collation[4];
-  private static final HashMap<String, Integer> collationNameToIdMap = new 
HashMap<>();
-
-  public static final int UTF8_BINARY_COLLATION_ID = 0;
-  public static final int UTF8_BINARY_LCASE_COLLATION_ID = 1;
-
-  static {
-    // Binary comparison. This is the default collation.
-    // No custom comparators will be used for this collation.
-    // Instead, we rely on byte for byte comparison.
-    collationTable[0] = new Collation(
-      "UTF8_BINARY",
-      null,
-      UTF8String::binaryCompare,
-      "1.0",
-      s -> (long)s.hashCode(),
-      true,
-      true,
-      false);
-
-    // Case-insensitive UTF8 binary collation.
-    // TODO: Do in place comparisons instead of creating new strings.
-    collationTable[1] = new Collation(
-      "UTF8_BINARY_LCASE",
-      null,
-      UTF8String::compareLowerCase,
-      "1.0",
-      (s) -> (long)s.toLowerCase().hashCode(),
-      false,
-      false,
-      true);
-
-    // UNICODE case sensitive comparison (ROOT locale, in ICU).
-    collationTable[2] = new Collation(
-      "UNICODE", Collator.getInstance(ULocale.ROOT), "153.120.0.0", true, 
false, false);
-    collationTable[2].collator.setStrength(Collator.TERTIARY);
-    collationTable[2].collator.freeze();
-
-    // UNICODE case-insensitive comparison (ROOT locale, in ICU + Secondary 
strength).
-    collationTable[3] = new Collation(
-      "UNICODE_CI", Collator.getInstance(ULocale.ROOT), "153.120.0.0", false, 
false, false);
-    collationTable[3].collator.setStrength(Collator.SECONDARY);
-    collationTable[3].collator.freeze();
-
-    for (int i = 0; i < collationTable.length; i++) {
-      collationNameToIdMap.put(collationTable[i].collationName, i);
+    private static class CollationSpecUTF8Binary extends CollationSpec {
+
+      private static final int CASE_SENSITIVITY_OFFSET = 0;
+      private static final int CASE_SENSITIVITY_MASK = 0b1;
+
+      private enum CaseSensitivity {
+        UNSPECIFIED, LCASE
+      }
+
+      private static final int UTF8_BINARY_COLLATION_ID =
+        new CollationSpecUTF8Binary(CaseSensitivity.UNSPECIFIED).collationId;
+      private static final int UTF8_BINARY_LCASE_COLLATION_ID =
+        new CollationSpecUTF8Binary(CaseSensitivity.LCASE).collationId;
+      protected static Collation UTF8_BINARY_COLLATION =
+        new 
CollationSpecUTF8Binary(CaseSensitivity.UNSPECIFIED).buildCollation();
+      protected static Collation UTF8_BINARY_LCASE_COLLATION =
+        new CollationSpecUTF8Binary(CaseSensitivity.LCASE).buildCollation();
+
+      private final int collationId;
+
+      private CollationSpecUTF8Binary(CaseSensitivity caseSensitivity) {
+        this.collationId =
+          SpecifierUtils.setSpecValue(0, CASE_SENSITIVITY_OFFSET, 
caseSensitivity);
+      }
+
+      private static int collationNameToId(String originalName, String 
collationName)
+          throws SparkException {
+        if (UTF8_BINARY_COLLATION.collationName.equals(collationName)) {
+          return UTF8_BINARY_COLLATION_ID;
+        } else if 
(UTF8_BINARY_LCASE_COLLATION.collationName.equals(collationName)) {
+          return UTF8_BINARY_LCASE_COLLATION_ID;
+        } else {
+          throw collationInvalidNameException(originalName);
+        }
+      }
+
+      private static CollationSpecUTF8Binary fromCollationId(int collationId) {
+        int caseConversionOrdinal = SpecifierUtils.getSpecValue(collationId,
+          CASE_SENSITIVITY_OFFSET, CASE_SENSITIVITY_MASK);
+        assert (SpecifierUtils.removeSpec(collationId,
+          CASE_SENSITIVITY_OFFSET, CASE_SENSITIVITY_MASK) == 0);
+        return new 
CollationSpecUTF8Binary(CaseSensitivity.values()[caseConversionOrdinal]);
+      }
+
+      @Override
+      protected Collation buildCollation() {
+        if (collationId == UTF8_BINARY_COLLATION_ID) {
+          return new Collation("UTF8_BINARY", null, UTF8String::binaryCompare, 
"1.0",
+            s -> (long) s.hashCode(), true, true, false);
+        } else {
+          return new Collation("UTF8_BINARY_LCASE", null, 
UTF8String::compareLowerCase, "1.0",
+            s -> (long) s.toLowerCase().hashCode(), false, false, true);
+        }
+      }
+    }
+
+    private static class CollationSpecICU extends CollationSpec {
+
+      private enum CaseSensitivity {
+        CS, CI
+      }
+
+      private enum AccentSensitivity {
+        AS, AI
+      }
+
+      private static final int CASE_SENSITIVITY_OFFSET = 17;
+      private static final int CASE_SENSITIVITY_MASK = 0b1;
+      private static final int ACCENT_SENSITIVITY_OFFSET = 16;
+      private static final int ACCENT_SENSITIVITY_MASK = 0b1;
+
+      // Array of locale names, each locale id corresponds to the index in 
this array
+      private static final String[] ICULocaleNames;
+
+      // Mapping of locale names to corresponding `ULocale` instance
+      private static final Map<String, ULocale> ICULocaleMap = new HashMap<>();
+
+      // Used to parse user input collation names which are converted to 
uppercase
+      private static final Map<String, String> ICULocaleMapUppercase = new 
HashMap<>();
+
+      // Reverse mapping of `ICULocaleNames`
+      private static final Map<String, Integer> ICULocaleToId = new 
HashMap<>();
+
+      private static final String ICU_COLLATOR_VERSION = "153.120.0.0";
+
+      static {
+        ICULocaleMap.put("UNICODE", ULocale.ROOT);
+        ULocale[] locales = Collator.getAvailableULocales();
+        for (ULocale locale : locales) {
+          if (locale.getVariant().isEmpty()) {
+            String language = locale.getLanguage();
+            assert (!language.isEmpty());
+            StringBuilder builder = new StringBuilder(language);
+            String script = locale.getScript();
+            if (!script.isEmpty()) {
+              builder.append('_');
+              builder.append(script);
+            }
+            String country = locale.getISO3Country();
+            if (!country.isEmpty()) {
+              builder.append('_');
+              builder.append(country);
+            }
+            String localeName = builder.toString();
+            // locale names are unique
+            assert (!ICULocaleMap.containsKey(localeName));
+            ICULocaleMap.put(localeName, locale);
+          }
+        }
+        for (String localeName : ICULocaleMap.keySet()) {
+          String localeUppercase = localeName.toUpperCase();
+          // locale names are unique case-insensitively
+          assert (!ICULocaleMapUppercase.containsKey(localeUppercase));
+          ICULocaleMapUppercase.put(localeUppercase, localeName);
+        }
+        ICULocaleNames = ICULocaleMap.keySet().toArray(new String[0]);
+        Arrays.sort(ICULocaleNames);
+        // maximum number of locale ids as defined by binary layout
+        assert (ICULocaleNames.length <= (1 << 12));
+        for (int i = 0; i < ICULocaleNames.length; ++i) {
+          ICULocaleToId.put(ICULocaleNames[i], i);
+        }
+      }
+
+      private static final int UNICODE_COLLATION_ID =
+        new CollationSpecICU("UNICODE", CaseSensitivity.CS, 
AccentSensitivity.AS).collationId;
+      private static final int UNICODE_CI_COLLATION_ID =
+        new CollationSpecICU("UNICODE", CaseSensitivity.CI, 
AccentSensitivity.AS).collationId;
+
+      private final CaseSensitivity caseSensitivity;
+      private final AccentSensitivity accentSensitivity;
+      private final String locale;
+      private final int collationId;
+
+      private CollationSpecICU(String locale, CaseSensitivity caseSensitivity,
+          AccentSensitivity accentSensitivity) {
+        this.locale = locale;
+        this.caseSensitivity = caseSensitivity;
+        this.accentSensitivity = accentSensitivity;
+        int collationId = ICULocaleToId.get(locale);
+        collationId = SpecifierUtils.setSpecValue(collationId, 
IMPLEMENTATION_PROVIDER_OFFSET,
+          ImplementationProvider.ICU);
+        collationId = SpecifierUtils.setSpecValue(collationId, 
CASE_SENSITIVITY_OFFSET,
+          caseSensitivity);
+        collationId = SpecifierUtils.setSpecValue(collationId, 
ACCENT_SENSITIVITY_OFFSET,
+          accentSensitivity);
+        this.collationId = collationId;
+      }
+
+      private static int collationNameToId(
+          String originalName, String collationName) throws SparkException {
+        // search for the longest locale match because specifiers are designed 
to be different from
+        // script tag and country code, meaning the only valid locale name 
match can be
+        // the longest one
+        int lastPos = -1;
+        for (int i = 1; i <= collationName.length(); i++) {
+          String localeName = collationName.substring(0, i);
+          if (ICULocaleMapUppercase.containsKey(localeName)) {
+            lastPos = i;
+          }
+        }
+        if (lastPos == -1) {
+          throw collationInvalidNameException(originalName);
+        } else {
+          String locale = collationName.substring(0, lastPos);
+          int collationId = 
ICULocaleToId.get(ICULocaleMapUppercase.get(locale));
+
+          // try all combinations of AS/AI and CS/CI
+          CaseSensitivity caseSensitivity;
+          AccentSensitivity accentSensitivity;
+          if (collationName.equals(locale) ||
+              collationName.equals(locale + "_AS") ||

Review Comment:
   Is `collationName` expected to be uppercase?



##########
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##########
@@ -118,76 +119,433 @@ public Collation(
     }
 
     /**
-     * Constructor with comparators that are inherited from the given collator.
+     * Collation id is defined as 32-bit integer.
+     * We specify binary layouts for different classes of collations.
+     * Classes of collations are differentiated by most significant 3 bits 
(bit 31, 30 and 29),
+     * bit 31 being most significant and bit 0 being least significant.
+     * ---
+     * INDETERMINATE collation id binary layout:
+     * bit 31-0: 1
+     * INDETERMINATE collation id is equal to -1
+     * ---
+     * user-defined collation id binary layout:
+     * bit 31:   0
+     * bit 30:   1
+     * bit 29-0: undefined, reserved for future use
+     * ---
+     * UTF8_BINARY collation id binary layout:
+     * bit 31-22: zeroes
+     * bit 21-18: zeroes, reserved for space trimming
+     * bit 17-16: zeroes, reserved for version
+     * bit 15-3:  zeroes
+     * bit 2:     0, reserved for accent sensitivity
+     * bit 1:     0, reserved for uppercase and case-insensitive
+     * bit 0:     0 = case-sensitive, 1 = lowercase
+     * ---
+     * ICU collation id binary layout:
+     * bit 31-30: zeroes
+     * bit 29:    1
+     * bit 28-24: zeroes
+     * bit 23-22: zeroes, reserved for version
+     * bit 21-18: zeroes, reserved for space trimming
+     * bit 17:    0 = case-sensitive, 1 = case-insensitive
+     * bit 16:    0 = accent-sensitive, 1 = accent-insensitive
+     * bit 15-14: zeroes, reserved for punctuation sensitivity
+     * bit 13-12: zeroes, reserved for first letter preference
+     * bit 11-0:  locale id as specified in `ICULocaleToId` mapping
+     * ---
+     * Some illustrative examples of collation name to id mapping:
+     * - UTF8_BINARY       -> 0
+     * - UTF8_BINARY_LCASE -> 1
+     * - UNICODE           -> 0x20000000
+     * - UNICODE_AI        -> 0x20010000
+     * - UNICODE_CI        -> 0x20020000
+     * - UNICODE_CI_AI     -> 0x20030000
+     * - af                -> 0x20000001
+     * - af_CI_AI          -> 0x20030001
      */
-    public Collation(
-        String collationName,
-        Collator collator,
-        String version,
-        boolean supportsBinaryEquality,
-        boolean supportsBinaryOrdering,
-        boolean supportsLowercaseEquality) {
-      this(
-        collationName,
-        collator,
-        (s1, s2) -> collator.compare(s1.toString(), s2.toString()),
-        version,
-        s -> (long)collator.getCollationKey(s.toString()).hashCode(),
-        supportsBinaryEquality,
-        supportsBinaryOrdering,
-        supportsLowercaseEquality);
+    private abstract static class CollationSpec {
+
+      private enum DefinitionOrigin {
+        PREDEFINED, USER_DEFINED
+      }
+
+      protected enum ImplementationProvider {
+        UTF8_BINARY, ICU
+      }
+
+      private static final int DEFINITION_ORIGIN_OFFSET = 30;
+      private static final int DEFINITION_ORIGIN_MASK = 0b1;
+      protected static final int IMPLEMENTATION_PROVIDER_OFFSET = 29;
+      protected static final int IMPLEMENTATION_PROVIDER_MASK = 0b1;
+
+      private static final int INDETERMINATE_COLLATION_ID = -1;
+
+      private static final Map<Integer, Collation> collationMap = new 
ConcurrentHashMap<>();
+
+      private static ImplementationProvider getImplementationProvider(int 
collationId) {
+        return 
ImplementationProvider.values()[SpecifierUtils.getSpecValue(collationId,
+          IMPLEMENTATION_PROVIDER_OFFSET, IMPLEMENTATION_PROVIDER_MASK)];
+      }
+
+      private static DefinitionOrigin getDefinitionOrigin(int collationId) {
+        return 
DefinitionOrigin.values()[SpecifierUtils.getSpecValue(collationId,
+          DEFINITION_ORIGIN_OFFSET, DEFINITION_ORIGIN_MASK)];
+      }
+
+      private static Collation fetchCollation(int collationId) {
+        assert (collationId >= 0 && getDefinitionOrigin(collationId)
+          == DefinitionOrigin.PREDEFINED);
+        if (collationId == UTF8_BINARY_COLLATION_ID) {
+          return CollationSpecUTF8Binary.UTF8_BINARY_COLLATION;
+        } else if (collationMap.containsKey(collationId)) {
+          return collationMap.get(collationId);
+        } else {
+          CollationSpec spec;
+          ImplementationProvider implementationProvider = 
getImplementationProvider(collationId);
+          if (implementationProvider == ImplementationProvider.UTF8_BINARY) {
+            spec = CollationSpecUTF8Binary.fromCollationId(collationId);
+          } else {
+            spec = CollationSpecICU.fromCollationId(collationId);
+          }
+          Collation collation = spec.buildCollation();
+          collationMap.put(collationId, collation);
+          return collation;
+        }
+      }
+
+      protected static SparkException collationInvalidNameException(String 
collationName) {
+        return new SparkException("COLLATION_INVALID_NAME",
+          SparkException.constructMessageParams(Map.of("collationName", 
collationName)), null);
+      }
+
+      private static int collationNameToId(String collationName) throws 
SparkException {
+        String collationNameUpper = collationName.toUpperCase();
+        if (collationNameUpper.startsWith("UTF8_BINARY")) {
+          return CollationSpecUTF8Binary.collationNameToId(collationName, 
collationNameUpper);
+        } else {
+          return CollationSpecICU.collationNameToId(collationName, 
collationNameUpper);
+        }
+      }
+
+      protected abstract Collation buildCollation();
     }
-  }
 
-  private static final Collation[] collationTable = new Collation[4];
-  private static final HashMap<String, Integer> collationNameToIdMap = new 
HashMap<>();
-
-  public static final int UTF8_BINARY_COLLATION_ID = 0;
-  public static final int UTF8_BINARY_LCASE_COLLATION_ID = 1;
-
-  static {
-    // Binary comparison. This is the default collation.
-    // No custom comparators will be used for this collation.
-    // Instead, we rely on byte for byte comparison.
-    collationTable[0] = new Collation(
-      "UTF8_BINARY",
-      null,
-      UTF8String::binaryCompare,
-      "1.0",
-      s -> (long)s.hashCode(),
-      true,
-      true,
-      false);
-
-    // Case-insensitive UTF8 binary collation.
-    // TODO: Do in place comparisons instead of creating new strings.
-    collationTable[1] = new Collation(
-      "UTF8_BINARY_LCASE",
-      null,
-      UTF8String::compareLowerCase,
-      "1.0",
-      (s) -> (long)s.toLowerCase().hashCode(),
-      false,
-      false,
-      true);
-
-    // UNICODE case sensitive comparison (ROOT locale, in ICU).
-    collationTable[2] = new Collation(
-      "UNICODE", Collator.getInstance(ULocale.ROOT), "153.120.0.0", true, 
false, false);
-    collationTable[2].collator.setStrength(Collator.TERTIARY);
-    collationTable[2].collator.freeze();
-
-    // UNICODE case-insensitive comparison (ROOT locale, in ICU + Secondary 
strength).
-    collationTable[3] = new Collation(
-      "UNICODE_CI", Collator.getInstance(ULocale.ROOT), "153.120.0.0", false, 
false, false);
-    collationTable[3].collator.setStrength(Collator.SECONDARY);
-    collationTable[3].collator.freeze();
-
-    for (int i = 0; i < collationTable.length; i++) {
-      collationNameToIdMap.put(collationTable[i].collationName, i);
+    private static class CollationSpecUTF8Binary extends CollationSpec {
+
+      private static final int CASE_SENSITIVITY_OFFSET = 0;
+      private static final int CASE_SENSITIVITY_MASK = 0b1;
+
+      private enum CaseSensitivity {
+        UNSPECIFIED, LCASE
+      }
+
+      private static final int UTF8_BINARY_COLLATION_ID =
+        new CollationSpecUTF8Binary(CaseSensitivity.UNSPECIFIED).collationId;
+      private static final int UTF8_BINARY_LCASE_COLLATION_ID =
+        new CollationSpecUTF8Binary(CaseSensitivity.LCASE).collationId;
+      protected static Collation UTF8_BINARY_COLLATION =
+        new 
CollationSpecUTF8Binary(CaseSensitivity.UNSPECIFIED).buildCollation();
+      protected static Collation UTF8_BINARY_LCASE_COLLATION =
+        new CollationSpecUTF8Binary(CaseSensitivity.LCASE).buildCollation();
+
+      private final int collationId;
+
+      private CollationSpecUTF8Binary(CaseSensitivity caseSensitivity) {
+        this.collationId =
+          SpecifierUtils.setSpecValue(0, CASE_SENSITIVITY_OFFSET, 
caseSensitivity);
+      }
+
+      private static int collationNameToId(String originalName, String 
collationName)
+          throws SparkException {
+        if (UTF8_BINARY_COLLATION.collationName.equals(collationName)) {
+          return UTF8_BINARY_COLLATION_ID;
+        } else if 
(UTF8_BINARY_LCASE_COLLATION.collationName.equals(collationName)) {
+          return UTF8_BINARY_LCASE_COLLATION_ID;
+        } else {
+          throw collationInvalidNameException(originalName);
+        }
+      }
+
+      private static CollationSpecUTF8Binary fromCollationId(int collationId) {
+        int caseConversionOrdinal = SpecifierUtils.getSpecValue(collationId,
+          CASE_SENSITIVITY_OFFSET, CASE_SENSITIVITY_MASK);
+        assert (SpecifierUtils.removeSpec(collationId,
+          CASE_SENSITIVITY_OFFSET, CASE_SENSITIVITY_MASK) == 0);
+        return new 
CollationSpecUTF8Binary(CaseSensitivity.values()[caseConversionOrdinal]);
+      }
+
+      @Override
+      protected Collation buildCollation() {
+        if (collationId == UTF8_BINARY_COLLATION_ID) {
+          return new Collation("UTF8_BINARY", null, UTF8String::binaryCompare, 
"1.0",
+            s -> (long) s.hashCode(), true, true, false);
+        } else {
+          return new Collation("UTF8_BINARY_LCASE", null, 
UTF8String::compareLowerCase, "1.0",
+            s -> (long) s.toLowerCase().hashCode(), false, false, true);
+        }
+      }
+    }
+
+    private static class CollationSpecICU extends CollationSpec {
+
+      private enum CaseSensitivity {
+        CS, CI
+      }
+
+      private enum AccentSensitivity {
+        AS, AI
+      }
+
+      private static final int CASE_SENSITIVITY_OFFSET = 17;
+      private static final int CASE_SENSITIVITY_MASK = 0b1;
+      private static final int ACCENT_SENSITIVITY_OFFSET = 16;
+      private static final int ACCENT_SENSITIVITY_MASK = 0b1;
+
+      // Array of locale names, each locale id corresponds to the index in 
this array
+      private static final String[] ICULocaleNames;
+
+      // Mapping of locale names to corresponding `ULocale` instance
+      private static final Map<String, ULocale> ICULocaleMap = new HashMap<>();
+
+      // Used to parse user input collation names which are converted to 
uppercase
+      private static final Map<String, String> ICULocaleMapUppercase = new 
HashMap<>();
+
+      // Reverse mapping of `ICULocaleNames`
+      private static final Map<String, Integer> ICULocaleToId = new 
HashMap<>();
+
+      private static final String ICU_COLLATOR_VERSION = "153.120.0.0";
+
+      static {
+        ICULocaleMap.put("UNICODE", ULocale.ROOT);
+        ULocale[] locales = Collator.getAvailableULocales();
+        for (ULocale locale : locales) {
+          if (locale.getVariant().isEmpty()) {
+            String language = locale.getLanguage();
+            assert (!language.isEmpty());
+            StringBuilder builder = new StringBuilder(language);
+            String script = locale.getScript();
+            if (!script.isEmpty()) {
+              builder.append('_');
+              builder.append(script);
+            }
+            String country = locale.getISO3Country();
+            if (!country.isEmpty()) {
+              builder.append('_');
+              builder.append(country);
+            }
+            String localeName = builder.toString();
+            // locale names are unique
+            assert (!ICULocaleMap.containsKey(localeName));
+            ICULocaleMap.put(localeName, locale);
+          }
+        }
+        for (String localeName : ICULocaleMap.keySet()) {
+          String localeUppercase = localeName.toUpperCase();
+          // locale names are unique case-insensitively
+          assert (!ICULocaleMapUppercase.containsKey(localeUppercase));
+          ICULocaleMapUppercase.put(localeUppercase, localeName);
+        }
+        ICULocaleNames = ICULocaleMap.keySet().toArray(new String[0]);
+        Arrays.sort(ICULocaleNames);
+        // maximum number of locale ids as defined by binary layout
+        assert (ICULocaleNames.length <= (1 << 12));
+        for (int i = 0; i < ICULocaleNames.length; ++i) {
+          ICULocaleToId.put(ICULocaleNames[i], i);
+        }
+      }
+
+      private static final int UNICODE_COLLATION_ID =
+        new CollationSpecICU("UNICODE", CaseSensitivity.CS, 
AccentSensitivity.AS).collationId;
+      private static final int UNICODE_CI_COLLATION_ID =
+        new CollationSpecICU("UNICODE", CaseSensitivity.CI, 
AccentSensitivity.AS).collationId;
+
+      private final CaseSensitivity caseSensitivity;
+      private final AccentSensitivity accentSensitivity;
+      private final String locale;
+      private final int collationId;
+
+      private CollationSpecICU(String locale, CaseSensitivity caseSensitivity,
+          AccentSensitivity accentSensitivity) {
+        this.locale = locale;
+        this.caseSensitivity = caseSensitivity;
+        this.accentSensitivity = accentSensitivity;
+        int collationId = ICULocaleToId.get(locale);
+        collationId = SpecifierUtils.setSpecValue(collationId, 
IMPLEMENTATION_PROVIDER_OFFSET,
+          ImplementationProvider.ICU);
+        collationId = SpecifierUtils.setSpecValue(collationId, 
CASE_SENSITIVITY_OFFSET,
+          caseSensitivity);
+        collationId = SpecifierUtils.setSpecValue(collationId, 
ACCENT_SENSITIVITY_OFFSET,
+          accentSensitivity);
+        this.collationId = collationId;
+      }
+
+      private static int collationNameToId(
+          String originalName, String collationName) throws SparkException {
+        // search for the longest locale match because specifiers are designed 
to be different from
+        // script tag and country code, meaning the only valid locale name 
match can be
+        // the longest one

Review Comment:
   ```suggestion
           // Search for the longest locale match because specifiers are 
designed to be different from
           // script tag and country code, meaning the only valid locale name 
match can be
           // the longest one.
   ```



##########
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##########
@@ -118,76 +119,433 @@ public Collation(
     }
 
     /**
-     * Constructor with comparators that are inherited from the given collator.
+     * Collation id is defined as 32-bit integer.
+     * We specify binary layouts for different classes of collations.
+     * Classes of collations are differentiated by most significant 3 bits 
(bit 31, 30 and 29),
+     * bit 31 being most significant and bit 0 being least significant.
+     * ---
+     * INDETERMINATE collation id binary layout:
+     * bit 31-0: 1
+     * INDETERMINATE collation id is equal to -1
+     * ---
+     * user-defined collation id binary layout:
+     * bit 31:   0
+     * bit 30:   1
+     * bit 29-0: undefined, reserved for future use
+     * ---
+     * UTF8_BINARY collation id binary layout:
+     * bit 31-22: zeroes
+     * bit 21-18: zeroes, reserved for space trimming
+     * bit 17-16: zeroes, reserved for version
+     * bit 15-3:  zeroes
+     * bit 2:     0, reserved for accent sensitivity
+     * bit 1:     0, reserved for uppercase and case-insensitive
+     * bit 0:     0 = case-sensitive, 1 = lowercase
+     * ---
+     * ICU collation id binary layout:
+     * bit 31-30: zeroes
+     * bit 29:    1
+     * bit 28-24: zeroes
+     * bit 23-22: zeroes, reserved for version

Review Comment:
   I think it would be helpful (as part of a follow up PR) to provide a 
bit-centric view (right now we have a collation centric view.
   Bit centric view:
   ```
   31: 1 for INDETERMINATE (requires all other bits to be 1 as well), 0 for all 
other collations.
   30: 0 for predefined, 1 for user-defined
   29-24: Reserved
   23-22: Reserved for version (here we have a discrepancy between UTF8_BINARY 
and ICU family).
   21-18: Reserved for space trimming
   17-0: Depend on collation family.
   ```



##########
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##########
@@ -118,76 +119,433 @@ public Collation(
     }
 
     /**
-     * Constructor with comparators that are inherited from the given collator.
+     * Collation id is defined as 32-bit integer.
+     * We specify binary layouts for different classes of collations.
+     * Classes of collations are differentiated by most significant 3 bits 
(bit 31, 30 and 29),
+     * bit 31 being most significant and bit 0 being least significant.
+     * ---
+     * INDETERMINATE collation id binary layout:
+     * bit 31-0: 1
+     * INDETERMINATE collation id is equal to -1
+     * ---
+     * user-defined collation id binary layout:
+     * bit 31:   0
+     * bit 30:   1
+     * bit 29-0: undefined, reserved for future use
+     * ---
+     * UTF8_BINARY collation id binary layout:
+     * bit 31-22: zeroes
+     * bit 21-18: zeroes, reserved for space trimming
+     * bit 17-16: zeroes, reserved for version
+     * bit 15-3:  zeroes
+     * bit 2:     0, reserved for accent sensitivity
+     * bit 1:     0, reserved for uppercase and case-insensitive
+     * bit 0:     0 = case-sensitive, 1 = lowercase
+     * ---
+     * ICU collation id binary layout:
+     * bit 31-30: zeroes
+     * bit 29:    1
+     * bit 28-24: zeroes
+     * bit 23-22: zeroes, reserved for version
+     * bit 21-18: zeroes, reserved for space trimming
+     * bit 17:    0 = case-sensitive, 1 = case-insensitive
+     * bit 16:    0 = accent-sensitive, 1 = accent-insensitive
+     * bit 15-14: zeroes, reserved for punctuation sensitivity
+     * bit 13-12: zeroes, reserved for first letter preference
+     * bit 11-0:  locale id as specified in `ICULocaleToId` mapping
+     * ---
+     * Some illustrative examples of collation name to id mapping:
+     * - UTF8_BINARY       -> 0
+     * - UTF8_BINARY_LCASE -> 1
+     * - UNICODE           -> 0x20000000
+     * - UNICODE_AI        -> 0x20010000
+     * - UNICODE_CI        -> 0x20020000
+     * - UNICODE_CI_AI     -> 0x20030000
+     * - af                -> 0x20000001
+     * - af_CI_AI          -> 0x20030001
      */
-    public Collation(
-        String collationName,
-        Collator collator,
-        String version,
-        boolean supportsBinaryEquality,
-        boolean supportsBinaryOrdering,
-        boolean supportsLowercaseEquality) {
-      this(
-        collationName,
-        collator,
-        (s1, s2) -> collator.compare(s1.toString(), s2.toString()),
-        version,
-        s -> (long)collator.getCollationKey(s.toString()).hashCode(),
-        supportsBinaryEquality,
-        supportsBinaryOrdering,
-        supportsLowercaseEquality);
+    private abstract static class CollationSpec {
+
+      private enum DefinitionOrigin {
+        PREDEFINED, USER_DEFINED
+      }
+
+      protected enum ImplementationProvider {
+        UTF8_BINARY, ICU
+      }
+
+      private static final int DEFINITION_ORIGIN_OFFSET = 30;
+      private static final int DEFINITION_ORIGIN_MASK = 0b1;
+      protected static final int IMPLEMENTATION_PROVIDER_OFFSET = 29;
+      protected static final int IMPLEMENTATION_PROVIDER_MASK = 0b1;
+
+      private static final int INDETERMINATE_COLLATION_ID = -1;
+
+      private static final Map<Integer, Collation> collationMap = new 
ConcurrentHashMap<>();
+
+      private static ImplementationProvider getImplementationProvider(int 
collationId) {
+        return 
ImplementationProvider.values()[SpecifierUtils.getSpecValue(collationId,
+          IMPLEMENTATION_PROVIDER_OFFSET, IMPLEMENTATION_PROVIDER_MASK)];
+      }
+
+      private static DefinitionOrigin getDefinitionOrigin(int collationId) {
+        return 
DefinitionOrigin.values()[SpecifierUtils.getSpecValue(collationId,
+          DEFINITION_ORIGIN_OFFSET, DEFINITION_ORIGIN_MASK)];
+      }
+
+      private static Collation fetchCollation(int collationId) {
+        assert (collationId >= 0 && getDefinitionOrigin(collationId)
+          == DefinitionOrigin.PREDEFINED);
+        if (collationId == UTF8_BINARY_COLLATION_ID) {
+          return CollationSpecUTF8Binary.UTF8_BINARY_COLLATION;
+        } else if (collationMap.containsKey(collationId)) {
+          return collationMap.get(collationId);
+        } else {
+          CollationSpec spec;
+          ImplementationProvider implementationProvider = 
getImplementationProvider(collationId);
+          if (implementationProvider == ImplementationProvider.UTF8_BINARY) {
+            spec = CollationSpecUTF8Binary.fromCollationId(collationId);
+          } else {
+            spec = CollationSpecICU.fromCollationId(collationId);
+          }
+          Collation collation = spec.buildCollation();
+          collationMap.put(collationId, collation);
+          return collation;
+        }
+      }
+
+      protected static SparkException collationInvalidNameException(String 
collationName) {
+        return new SparkException("COLLATION_INVALID_NAME",
+          SparkException.constructMessageParams(Map.of("collationName", 
collationName)), null);
+      }
+
+      private static int collationNameToId(String collationName) throws 
SparkException {
+        String collationNameUpper = collationName.toUpperCase();
+        if (collationNameUpper.startsWith("UTF8_BINARY")) {
+          return CollationSpecUTF8Binary.collationNameToId(collationName, 
collationNameUpper);
+        } else {
+          return CollationSpecICU.collationNameToId(collationName, 
collationNameUpper);
+        }
+      }
+
+      protected abstract Collation buildCollation();
     }
-  }
 
-  private static final Collation[] collationTable = new Collation[4];
-  private static final HashMap<String, Integer> collationNameToIdMap = new 
HashMap<>();
-
-  public static final int UTF8_BINARY_COLLATION_ID = 0;
-  public static final int UTF8_BINARY_LCASE_COLLATION_ID = 1;
-
-  static {
-    // Binary comparison. This is the default collation.
-    // No custom comparators will be used for this collation.
-    // Instead, we rely on byte for byte comparison.
-    collationTable[0] = new Collation(
-      "UTF8_BINARY",
-      null,
-      UTF8String::binaryCompare,
-      "1.0",
-      s -> (long)s.hashCode(),
-      true,
-      true,
-      false);
-
-    // Case-insensitive UTF8 binary collation.
-    // TODO: Do in place comparisons instead of creating new strings.
-    collationTable[1] = new Collation(
-      "UTF8_BINARY_LCASE",
-      null,
-      UTF8String::compareLowerCase,
-      "1.0",
-      (s) -> (long)s.toLowerCase().hashCode(),
-      false,
-      false,
-      true);
-
-    // UNICODE case sensitive comparison (ROOT locale, in ICU).
-    collationTable[2] = new Collation(
-      "UNICODE", Collator.getInstance(ULocale.ROOT), "153.120.0.0", true, 
false, false);
-    collationTable[2].collator.setStrength(Collator.TERTIARY);
-    collationTable[2].collator.freeze();
-
-    // UNICODE case-insensitive comparison (ROOT locale, in ICU + Secondary 
strength).
-    collationTable[3] = new Collation(
-      "UNICODE_CI", Collator.getInstance(ULocale.ROOT), "153.120.0.0", false, 
false, false);
-    collationTable[3].collator.setStrength(Collator.SECONDARY);
-    collationTable[3].collator.freeze();
-
-    for (int i = 0; i < collationTable.length; i++) {
-      collationNameToIdMap.put(collationTable[i].collationName, i);
+    private static class CollationSpecUTF8Binary extends CollationSpec {
+
+      private static final int CASE_SENSITIVITY_OFFSET = 0;
+      private static final int CASE_SENSITIVITY_MASK = 0b1;
+
+      private enum CaseSensitivity {
+        UNSPECIFIED, LCASE
+      }
+
+      private static final int UTF8_BINARY_COLLATION_ID =
+        new CollationSpecUTF8Binary(CaseSensitivity.UNSPECIFIED).collationId;
+      private static final int UTF8_BINARY_LCASE_COLLATION_ID =
+        new CollationSpecUTF8Binary(CaseSensitivity.LCASE).collationId;
+      protected static Collation UTF8_BINARY_COLLATION =
+        new 
CollationSpecUTF8Binary(CaseSensitivity.UNSPECIFIED).buildCollation();
+      protected static Collation UTF8_BINARY_LCASE_COLLATION =
+        new CollationSpecUTF8Binary(CaseSensitivity.LCASE).buildCollation();
+
+      private final int collationId;
+
+      private CollationSpecUTF8Binary(CaseSensitivity caseSensitivity) {
+        this.collationId =
+          SpecifierUtils.setSpecValue(0, CASE_SENSITIVITY_OFFSET, 
caseSensitivity);
+      }
+
+      private static int collationNameToId(String originalName, String 
collationName)
+          throws SparkException {
+        if (UTF8_BINARY_COLLATION.collationName.equals(collationName)) {
+          return UTF8_BINARY_COLLATION_ID;
+        } else if 
(UTF8_BINARY_LCASE_COLLATION.collationName.equals(collationName)) {
+          return UTF8_BINARY_LCASE_COLLATION_ID;
+        } else {
+          throw collationInvalidNameException(originalName);
+        }
+      }
+
+      private static CollationSpecUTF8Binary fromCollationId(int collationId) {
+        int caseConversionOrdinal = SpecifierUtils.getSpecValue(collationId,
+          CASE_SENSITIVITY_OFFSET, CASE_SENSITIVITY_MASK);
+        assert (SpecifierUtils.removeSpec(collationId,
+          CASE_SENSITIVITY_OFFSET, CASE_SENSITIVITY_MASK) == 0);
+        return new 
CollationSpecUTF8Binary(CaseSensitivity.values()[caseConversionOrdinal]);
+      }
+
+      @Override
+      protected Collation buildCollation() {
+        if (collationId == UTF8_BINARY_COLLATION_ID) {
+          return new Collation("UTF8_BINARY", null, UTF8String::binaryCompare, 
"1.0",
+            s -> (long) s.hashCode(), true, true, false);
+        } else {
+          return new Collation("UTF8_BINARY_LCASE", null, 
UTF8String::compareLowerCase, "1.0",
+            s -> (long) s.toLowerCase().hashCode(), false, false, true);
+        }
+      }
+    }
+
+    private static class CollationSpecICU extends CollationSpec {
+
+      private enum CaseSensitivity {
+        CS, CI
+      }
+
+      private enum AccentSensitivity {
+        AS, AI
+      }
+
+      private static final int CASE_SENSITIVITY_OFFSET = 17;
+      private static final int CASE_SENSITIVITY_MASK = 0b1;
+      private static final int ACCENT_SENSITIVITY_OFFSET = 16;
+      private static final int ACCENT_SENSITIVITY_MASK = 0b1;
+
+      // Array of locale names, each locale id corresponds to the index in 
this array
+      private static final String[] ICULocaleNames;
+
+      // Mapping of locale names to corresponding `ULocale` instance
+      private static final Map<String, ULocale> ICULocaleMap = new HashMap<>();
+
+      // Used to parse user input collation names which are converted to 
uppercase
+      private static final Map<String, String> ICULocaleMapUppercase = new 
HashMap<>();
+
+      // Reverse mapping of `ICULocaleNames`
+      private static final Map<String, Integer> ICULocaleToId = new 
HashMap<>();
+
+      private static final String ICU_COLLATOR_VERSION = "153.120.0.0";
+
+      static {
+        ICULocaleMap.put("UNICODE", ULocale.ROOT);
+        ULocale[] locales = Collator.getAvailableULocales();
+        for (ULocale locale : locales) {
+          if (locale.getVariant().isEmpty()) {
+            String language = locale.getLanguage();
+            assert (!language.isEmpty());
+            StringBuilder builder = new StringBuilder(language);
+            String script = locale.getScript();
+            if (!script.isEmpty()) {
+              builder.append('_');
+              builder.append(script);
+            }
+            String country = locale.getISO3Country();
+            if (!country.isEmpty()) {
+              builder.append('_');
+              builder.append(country);
+            }
+            String localeName = builder.toString();
+            // locale names are unique
+            assert (!ICULocaleMap.containsKey(localeName));
+            ICULocaleMap.put(localeName, locale);
+          }
+        }
+        for (String localeName : ICULocaleMap.keySet()) {
+          String localeUppercase = localeName.toUpperCase();
+          // locale names are unique case-insensitively
+          assert (!ICULocaleMapUppercase.containsKey(localeUppercase));
+          ICULocaleMapUppercase.put(localeUppercase, localeName);
+        }
+        ICULocaleNames = ICULocaleMap.keySet().toArray(new String[0]);
+        Arrays.sort(ICULocaleNames);
+        // maximum number of locale ids as defined by binary layout
+        assert (ICULocaleNames.length <= (1 << 12));
+        for (int i = 0; i < ICULocaleNames.length; ++i) {
+          ICULocaleToId.put(ICULocaleNames[i], i);
+        }
+      }
+
+      private static final int UNICODE_COLLATION_ID =
+        new CollationSpecICU("UNICODE", CaseSensitivity.CS, 
AccentSensitivity.AS).collationId;
+      private static final int UNICODE_CI_COLLATION_ID =
+        new CollationSpecICU("UNICODE", CaseSensitivity.CI, 
AccentSensitivity.AS).collationId;
+
+      private final CaseSensitivity caseSensitivity;
+      private final AccentSensitivity accentSensitivity;
+      private final String locale;
+      private final int collationId;
+
+      private CollationSpecICU(String locale, CaseSensitivity caseSensitivity,
+          AccentSensitivity accentSensitivity) {
+        this.locale = locale;
+        this.caseSensitivity = caseSensitivity;
+        this.accentSensitivity = accentSensitivity;
+        int collationId = ICULocaleToId.get(locale);
+        collationId = SpecifierUtils.setSpecValue(collationId, 
IMPLEMENTATION_PROVIDER_OFFSET,
+          ImplementationProvider.ICU);
+        collationId = SpecifierUtils.setSpecValue(collationId, 
CASE_SENSITIVITY_OFFSET,
+          caseSensitivity);
+        collationId = SpecifierUtils.setSpecValue(collationId, 
ACCENT_SENSITIVITY_OFFSET,
+          accentSensitivity);
+        this.collationId = collationId;
+      }
+
+      private static int collationNameToId(
+          String originalName, String collationName) throws SparkException {
+        // search for the longest locale match because specifiers are designed 
to be different from
+        // script tag and country code, meaning the only valid locale name 
match can be
+        // the longest one
+        int lastPos = -1;
+        for (int i = 1; i <= collationName.length(); i++) {
+          String localeName = collationName.substring(0, i);
+          if (ICULocaleMapUppercase.containsKey(localeName)) {
+            lastPos = i;
+          }
+        }
+        if (lastPos == -1) {
+          throw collationInvalidNameException(originalName);
+        } else {
+          String locale = collationName.substring(0, lastPos);
+          int collationId = 
ICULocaleToId.get(ICULocaleMapUppercase.get(locale));
+
+          // try all combinations of AS/AI and CS/CI

Review Comment:
   Please use full/regular sentences for comments. I do not plan to comment on 
this in the rest of the PR.



##########
sql/core/src/test/scala/org/apache/spark/sql/ICUCollationsMap.scala:
##########
@@ -0,0 +1,69 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.sql.catalyst.util.{fileToString, stringToFile, 
CollationFactory}
+
+// scalastyle:off line.size.limit
+/**
+ * Guard against breaking changes in ICU locale names and codes supported by 
Collator class and provider by CollationFactory.
+ * Map is in form of rows of pairs (locale name, locale id); locale name 
consists of three parts:
+ * - 2-letter lowercase language code
+ * - 4-letter script code (optional)
+ * - 3-letter uppercase country code
+ *
+ * To re-generate collations map golden file, run:
+ * {{{
+ *   SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly 
org.apache.spark.sql.ICUCollationsMap"
+ * }}}
+ */
+// scalastyle:on line.size.limit
+class ICUCollationsMap extends SparkFunSuite {

Review Comment:
   Just curious: since this is a test, shouldn't the name be like 
`ICUCollationsMapSuite`?



##########
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##########
@@ -117,76 +118,490 @@ public Collation(
     }
 
     /**
-     * Constructor with comparators that are inherited from the given collator.
+     * collation id (32-bit integer) layout:
+     * bit 31:    0 = predefined collation, 1 = user-defined collation
+     * bit 30:    0 = utf8-binary, 1 = ICU
+     * bit 29:    0 = case-sensitive, 1 = case-insensitive
+     * bit 28:    0 = accent-sensitive, 1 = accent-insensitive
+     * bit 27-26: 00 = unspecified, 01 = punctuation-sensitive, 10 = 
punctuation-insensitive
+     * bit 25-24: 00 = unspecified, 01 = first-lower, 10 = first-upper
+     * bit 23-22: 00 = unspecified, 01 = to-lower, 10 = to-upper
+     * bit 21-20: 00 = unspecified, 01 = trim-left, 10 = trim-right, 11 = 
trim-both
+     * bit 19-18: zeroes, reserved for version
+     * bit 17-16: zeroes
+     * bit 15-0:  locale id for ICU collations / zeroes for utf8-binary
      */
-    public Collation(
-        String collationName,
-        Collator collator,
-        String version,
-        boolean supportsBinaryEquality,
-        boolean supportsBinaryOrdering,
-        boolean supportsLowercaseEquality) {
-      this(
-        collationName,
-        collator,
-        (s1, s2) -> collator.compare(s1.toString(), s2.toString()),
-        version,
-        s -> (long)collator.getCollationKey(s.toString()).hashCode(),
-        supportsBinaryEquality,
-        supportsBinaryOrdering,
-        supportsLowercaseEquality);
-    }
-  }
+    private static class CollationSpec {
+      private enum ImplementationProvider {
+        UTF8_BINARY, ICU
+      }
+
+      private enum CaseSensitivity {
+        CS, CI
+      }
+
+      private enum AccentSensitivity {
+        AS, AI
+      }
+
+      private enum PunctuationSensitivity {
+        UNSPECIFIED, PS, PI
+      }
 
-  private static final Collation[] collationTable = new Collation[4];
-  private static final HashMap<String, Integer> collationNameToIdMap = new 
HashMap<>();
-
-  public static final int UTF8_BINARY_COLLATION_ID = 0;
-  public static final int UTF8_BINARY_LCASE_COLLATION_ID = 1;
-
-  static {
-    // Binary comparison. This is the default collation.
-    // No custom comparators will be used for this collation.
-    // Instead, we rely on byte for byte comparison.
-    collationTable[0] = new Collation(
-      "UTF8_BINARY",
-      null,
-      UTF8String::binaryCompare,
-      "1.0",
-      s -> (long)s.hashCode(),
-      true,
-      true,
-      false);
-
-    // Case-insensitive UTF8 binary collation.
-    // TODO: Do in place comparisons instead of creating new strings.
-    collationTable[1] = new Collation(
-      "UTF8_BINARY_LCASE",
-      null,
-      UTF8String::compareLowerCase,
-      "1.0",
-      (s) -> (long)s.toLowerCase().hashCode(),
-      false,
-      false,
-      true);
-
-    // UNICODE case sensitive comparison (ROOT locale, in ICU).
-    collationTable[2] = new Collation(
-      "UNICODE", Collator.getInstance(ULocale.ROOT), "153.120.0.0", true, 
false, false);
-    collationTable[2].collator.setStrength(Collator.TERTIARY);
-    collationTable[2].collator.freeze();
-
-    // UNICODE case-insensitive comparison (ROOT locale, in ICU + Secondary 
strength).
-    collationTable[3] = new Collation(
-      "UNICODE_CI", Collator.getInstance(ULocale.ROOT), "153.120.0.0", false, 
false, false);
-    collationTable[3].collator.setStrength(Collator.SECONDARY);
-    collationTable[3].collator.freeze();
-
-    for (int i = 0; i < collationTable.length; i++) {
-      collationNameToIdMap.put(collationTable[i].collationName, i);
+      private enum FirstLetterPreference {
+        UNSPECIFIED, FU, FL
+      }
+
+      private enum CaseConversion {
+        UNSPECIFIED, LCASE, UCASE
+      }
+
+      private enum SpaceTrimming {
+        UNSPECIFIED, LTRIM, RTRIM, TRIM
+      }
+
+      private static final int implementationProviderOffset = 30;
+      private static final int implementationProviderLen = 1;
+      private static final int caseSensitivityOffset = 29;
+      private static final int caseSensitivityLen = 1;
+      private static final int accentSensitivityOffset = 28;
+      private static final int accentSensitivityLen = 1;
+      private static final int punctuationSensitivityOffset = 26;
+      private static final int punctuationSensitivityLen = 2;
+      private static final int firstLetterPreferenceOffset = 24;
+      private static final int firstLetterPreferenceLen = 2;
+      private static final int caseConversionOffset = 22;
+      private static final int caseConversionLen = 2;
+      private static final int spaceTrimmingOffset = 20;
+      private static final int spaceTrimmingLen = 2;
+      private static final int localeOffset = 0;
+      private static final int localeLen = 16;
+
+      private static final String[] ICULocaleNames;
+      private static final Map<String, ULocale> ICULocaleMap = new HashMap<>();
+      private static final Map<String, String> ICULocaleMapUppercase = new 
HashMap<>();
+      private static final Map<String, Integer> ICULocaleToId = new 
HashMap<>();
+
+      static {
+        ICULocaleMap.put("UNICODE", ULocale.ROOT);
+        ULocale[] locales = Collator.getAvailableULocales();

Review Comment:
   I like the approach taken here. Thank you!



##########
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##########
@@ -117,76 +119,445 @@ public Collation(
     }
 
     /**
-     * Constructor with comparators that are inherited from the given collator.
+     * collation id (32-bit integer) layout:
+     * bit 31:    0 = predefined collation, 1 = user-defined collation
+     * bit 30-29: 00 = utf8-binary, 01 = ICU, 10 = indeterminate (without spec 
implementation)
+     * bit 28:    0 for utf8-binary / 0 = case-sensitive, 1 = case-insensitive 
for ICU
+     * bit 27:    0 for utf8-binary / 0 = accent-sensitive, 1 = 
accent-insensitive for ICU
+     * bit 26-25: zeroes, reserved for punctuation sensitivity
+     * bit 24-23: zeroes, reserved for first letter preference
+     * bit 22-21: 00 = unspecified, 01 = to-lower, 10 = to-upper
+     * bit 20-19: zeroes, reserved for space trimming
+     * bit 18-17: zeroes, reserved for version
+     * bit 16-12: zeroes
+     * bit 11-0:  zeroes for utf8-binary / locale id for ICU
      */
-    public Collation(
-        String collationName,
-        Collator collator,
-        String version,
-        boolean supportsBinaryEquality,
-        boolean supportsBinaryOrdering,
-        boolean supportsLowercaseEquality) {
-      this(
-        collationName,
-        collator,
-        (s1, s2) -> collator.compare(s1.toString(), s2.toString()),
-        version,
-        s -> (long)collator.getCollationKey(s.toString()).hashCode(),
-        supportsBinaryEquality,
-        supportsBinaryOrdering,
-        supportsLowercaseEquality);
+    private abstract static class CollationSpec {
+      protected enum ImplementationProvider {
+        UTF8_BINARY, ICU, INDETERMINATE
+      }
+
+      protected enum CaseSensitivity {
+        CS, CI
+      }
+
+      protected enum AccentSensitivity {
+        AS, AI
+      }
+
+      protected enum CaseConversion {
+        UNSPECIFIED, LCASE, UCASE
+      }
+
+      protected static final int IMPLEMENTATION_PROVIDER_OFFSET = 29;
+      protected static final int IMPLEMENTATION_PROVIDER_MASK = 0b11;
+      protected static final int CASE_SENSITIVITY_OFFSET = 28;
+      protected static final int CASE_SENSITIVITY_MASK = 0b1;
+      protected static final int ACCENT_SENSITIVITY_OFFSET = 27;
+      protected static final int ACCENT_SENSITIVITY_MASK = 0b1;
+      protected static final int CASE_CONVERSION_OFFSET = 21;
+      protected static final int CASE_CONVERSION_MASK = 0b11;
+      protected static final int LOCALE_OFFSET = 0;
+      protected static final int LOCALE_MASK = 0x0FFF;
+
+      protected static final int INDETERMINATE_COLLATION_ID =
+        ImplementationProvider.INDETERMINATE.ordinal() << 
IMPLEMENTATION_PROVIDER_OFFSET;
+
+      protected final CaseSensitivity caseSensitivity;
+      protected final AccentSensitivity accentSensitivity;
+      protected final CaseConversion caseConversion;
+      protected final String locale;
+      protected final int collationId;
+
+      protected CollationSpec(
+          String locale,
+          CaseSensitivity caseSensitivity,
+          AccentSensitivity accentSensitivity,
+          CaseConversion caseConversion) {
+        this.locale = locale;
+        this.caseSensitivity = caseSensitivity;
+        this.accentSensitivity = accentSensitivity;
+        this.caseConversion = caseConversion;
+        this.collationId = getCollationId();
+      }
+
+      private static final Map<Integer, Collation> collationMap = new 
ConcurrentHashMap<>();
+
+      public static Collation fetchCollation(int collationId) throws 
SparkException {
+        if (collationId == UTF8_BINARY_COLLATION_ID) {
+          return CollationSpecUTF8Binary.UTF8_BINARY_COLLATION;
+        } else if (collationMap.containsKey(collationId)) {
+          return collationMap.get(collationId);
+        } else {
+          CollationSpec spec;
+          int implementationProviderOrdinal =
+            (collationId >> IMPLEMENTATION_PROVIDER_OFFSET) & 
IMPLEMENTATION_PROVIDER_MASK;
+          if (implementationProviderOrdinal >= 
ImplementationProvider.values().length) {
+            throw SparkException.internalError("Invalid collation 
implementation provider");
+          } else {
+            ImplementationProvider implementationProvider = 
ImplementationProvider.values()[
+              implementationProviderOrdinal];
+            if (implementationProvider == ImplementationProvider.UTF8_BINARY) {
+              spec = CollationSpecUTF8Binary.fromCollationId(collationId);
+            } else if (implementationProvider == ImplementationProvider.ICU) {
+              spec = CollationSpecICU.fromCollationId(collationId);
+            } else {
+              throw SparkException.internalError("Cannot instantiate 
indeterminate collation");
+            }
+            Collation collation = spec.buildCollation();
+            collationMap.put(collationId, collation);
+            return collation;
+          }
+        }
+      }
+
+      protected static SparkException collationInvalidNameException(String 
collationName) {
+        return new SparkException("COLLATION_INVALID_NAME",
+          SparkException.constructMessageParams(Map.of("collationName", 
collationName)), null);
+      }
+
+      public static int collationNameToId(String collationName) throws 
SparkException {
+        String collationNameUpper = collationName.toUpperCase();

Review Comment:
   Which gets us to an interesting problem: we use `toUpperCase` which depends 
on the JVM version, so technically it is not stable across time. Is it fair to 
say that for predefined collations everything is ASCII?
   For user-defined collations we may want to think if we want to be as generic 
as what we are now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to