On Thu, 2023-02-09 at 10:53 -0500, Robert Haas wrote: > Unfortunately, I have no idea whether your specific ideas about how > to > make that happen are any good or not. But I hope they are, because > the > current situation is pessimal.
It feels like BCP 47 is the right catalog representation. We are already using it for the import of initial collations, and it's a standard, and there seems to be good support in ICU. There are a couple cases where canonicalization will succeed but conversion to a BCP 47 language tag will fail. One is for unsupported attributes, like "en_US@foo=bar". Another is a bug I found and reported here: https://unicode-org.atlassian.net/browse/ICU-22268 In both cases, we know that conversion has failed, and we have a choice about how to proceed. We can fail, warn and continue with the user- entered representation, or turn off the strictness checking and come up with some BCP 47 tag and see if it resolves to the same collator. I do like the ICU format locale IDs from a readability standpoint. "en_US@colstrength=primary" is more meaningful to me than "en-US-u-ks- level1" (the equivalent language tag). And the format is specified[1], even though it's not an independent standard. But I think the benefits of better validation, an independent standard, and the fact that we're already favoring BCP47 outweigh my subjective opinion. I also attached a simple test program that I've been using to experiment (not intended for code review). It's hard for me to say that I'm sure I'm right. I really just got involved in this a few months back, and had a few off-list conversations with Peter Eisentraut to try to learn more (I believe he is aligned with my proposal but I will let him speak for himself). I should also say that I'm not exactly an expert in languages or scripts. I assume that ICU and IETF are doing sensible things to accommodate the diversity of human language as well as they can (or at least much better than the Postgres project could do on its own). I'm happy to hear more input or other proposals. [1] https://unicode-org.github.io/icu/userguide/locale/#canonicalization -- Jeff Davis PostgreSQL Contributor Team - AWS
#include <stdbool.h> #include <stdio.h> #include <unicode/ucol.h> #define CAPACITY 1024 int main(int argc, char *argv[]) { UErrorCode status; UCollator *collator; const char *requested_locale = NULL; const char *valid_locale; const char *actual_locale; char canonical[CAPACITY] = {0}; char variant[CAPACITY] = {0}; char basename[CAPACITY] = {0}; char getname[CAPACITY] = {0}; char langtag[CAPACITY] = {0}; char langtag_s[CAPACITY] = {0}; if (argc > 1) { requested_locale = argv[1]; status = U_ZERO_ERROR; uloc_canonicalize(requested_locale, canonical, CAPACITY, &status); if (U_FAILURE(status)) printf("canonicalize error: %s\n", u_errorName(status)); status = U_ZERO_ERROR; uloc_getBaseName(requested_locale, basename, CAPACITY, &status); if (U_FAILURE(status)) printf("basename error: %s\n", u_errorName(status)); status = U_ZERO_ERROR; uloc_getName(requested_locale, getname, CAPACITY, &status); if (U_FAILURE(status)) printf("getname error: %s\n", u_errorName(status)); status = U_ZERO_ERROR; uloc_getVariant(requested_locale, variant, CAPACITY, &status); if (U_FAILURE(status)) printf("variant error: %s\n", u_errorName(status)); status = U_ZERO_ERROR; uloc_toLanguageTag(requested_locale, langtag, CAPACITY, false, &status); if (U_FAILURE(status)) printf("langtag error: %s\n", u_errorName(status)); uloc_toLanguageTag(requested_locale, langtag_s, CAPACITY, true, &status); if (U_FAILURE(status)) printf("langtag strict error: %s\n", u_errorName(status)); } else if (argc > 2) fprintf(stderr, "too many arguments"); status = U_ZERO_ERROR; collator = ucol_open(requested_locale, &status); valid_locale = ucol_getLocaleByType(collator, ULOC_VALID_LOCALE, &status); actual_locale = ucol_getLocaleByType(collator, ULOC_ACTUAL_LOCALE, &status); printf("canonicalize: %s\n", canonical); printf("langtag : %s\n", langtag); printf("langtag strict: %s\n", langtag_s); printf("variant: %s\n", variant); printf("getBaseName: %s\n", basename); printf("getName: %s\n", getname); printf("valid locale: %s\n", valid_locale); printf("actual locale: %s\n", actual_locale); ucol_close(collator); }