Re: ICU locale validation / canonicalization

Jeff Davis Thu, 09 Feb 2023 14:10:01 -0800

On Thu, 2023-02-09 at 10:53 -0500, Robert Haas wrote:
> Unfortunately, I have no idea whether your specific ideas about how
> to
> make that happen are any good or not. But I hope they are, because
> the
> current situation is pessimal.


It feels like BCP 47 is the right catalog representation. We are
already using it for the import of initial collations, and it's a
standard, and there seems to be good support in ICU.

There are a couple cases where canonicalization will succeed but
conversion to a BCP 47 language tag will fail. One is for unsupported
attributes, like "en_US@foo=bar". Another is a bug I found and reported
here:

https://unicode-org.atlassian.net/browse/ICU-22268

In both cases, we know that conversion has failed, and we have a choice
about how to proceed. We can fail, warn and continue with the user-
entered representation, or turn off the strictness checking and come up
with some BCP 47 tag and see if it resolves to the same collator.

I do like the ICU format locale IDs from a readability standpoint.
"en_US@colstrength=primary" is more meaningful to me than "en-US-u-ks-
level1" (the equivalent language tag). And the format is specified[1],
even though it's not an independent standard. But I think the benefits
of better validation, an independent standard, and the fact that we're
already favoring BCP47 outweigh my subjective opinion.

I also attached a simple test program that I've been using to
experiment (not intended for code review).

It's hard for me to say that I'm sure I'm right. I really just got
involved in this a few months back, and had a few off-list
conversations with Peter Eisentraut to try to learn more (I believe he
is aligned with my proposal but I will let him speak for himself).

I should also say that I'm not exactly an expert in languages or
scripts. I assume that ICU and IETF are doing sensible things to
accommodate the diversity of human language as well as they can (or at
least much better than the Postgres project could do on its own).

I'm happy to hear more input or other proposals.

[1]
https://unicode-org.github.io/icu/userguide/locale/#canonicalization

-- 
Jeff Davis
PostgreSQL Contributor Team - AWS

#include <stdbool.h>
#include <stdio.h>
#include <unicode/ucol.h>

#define CAPACITY 1024

int main(int argc, char *argv[])
{
  UErrorCode	 status;
  UCollator	*collator;
  const char	*requested_locale = NULL;
  const char	*valid_locale;
  const char	*actual_locale;
  char		 canonical[CAPACITY] = {0};
  char		 variant[CAPACITY] = {0};
  char		 basename[CAPACITY] = {0};
  char		 getname[CAPACITY] = {0};
  char		 langtag[CAPACITY] = {0};
  char		 langtag_s[CAPACITY] = {0};

  if (argc > 1)
    {
      requested_locale = argv[1];
      status = U_ZERO_ERROR;
      uloc_canonicalize(requested_locale, canonical, CAPACITY, &status);
      if (U_FAILURE(status))
	printf("canonicalize error: %s\n", u_errorName(status));
      status = U_ZERO_ERROR;
      uloc_getBaseName(requested_locale, basename, CAPACITY, &status);
      if (U_FAILURE(status))
	printf("basename error: %s\n", u_errorName(status));
      status = U_ZERO_ERROR;
      uloc_getName(requested_locale, getname, CAPACITY, &status);
      if (U_FAILURE(status))
	printf("getname error: %s\n", u_errorName(status));
      status = U_ZERO_ERROR;
      uloc_getVariant(requested_locale, variant, CAPACITY, &status);
      if (U_FAILURE(status))
	printf("variant error: %s\n", u_errorName(status));
      status = U_ZERO_ERROR;
      uloc_toLanguageTag(requested_locale, langtag, CAPACITY, false, &status);
      if (U_FAILURE(status))
	printf("langtag error: %s\n", u_errorName(status));
      uloc_toLanguageTag(requested_locale, langtag_s, CAPACITY, true, &status);
      if (U_FAILURE(status))
	printf("langtag strict error: %s\n", u_errorName(status));
    }
  else if (argc > 2)
    fprintf(stderr, "too many arguments");
  
  status = U_ZERO_ERROR;
  collator = ucol_open(requested_locale, &status);
  valid_locale = ucol_getLocaleByType(collator, ULOC_VALID_LOCALE, &status);
  actual_locale = ucol_getLocaleByType(collator, ULOC_ACTUAL_LOCALE, &status);
  printf("canonicalize: %s\n", canonical);
  printf("langtag       : %s\n", langtag);
  printf("langtag strict: %s\n", langtag_s);
  printf("variant: %s\n", variant);
  printf("getBaseName: %s\n", basename);
  printf("getName: %s\n", getname);
  printf("valid locale: %s\n", valid_locale);
  printf("actual locale: %s\n", actual_locale);
  ucol_close(collator);
}

Re: ICU locale validation / canonicalization

Reply via email to