Re: [HACKERS] ICU integration

Andreas Karlsson Tue, 14 Mar 2017 19:16:04 -0700

On 03/09/2017 10:13 PM, Peter Eisentraut wrote:

- Naming of collations:  Are we happy with the "de%icu" naming?  I might
have come up with that while reviewing the IPv6 zone index patch. ;-)
An alternative might be "de$icu" for more Oracle vibe and avoiding the
need for double quotes in some cases.  (But we have mixed-case names
like "de_AT%icu", so further changes would be necessary to fully get rid
of the need for quoting.)  A more radical alternative would be to
install ICU locales in a different schema and use the search_path, or
even have a separate search path setting for collations only.  Which
leads to ...

I do not like the schema based solution since search_path already givesus enough headaches. As for the naming I am fine with the current scheme.

Would be nice with something we did not have to quote, but I do not likeusing dollar signs since they are already use for other things.

- Selecting default collation provider:  Maybe we want a setting, say in
initdb, to determine which provider's collations get the "good" names?
Maybe not necessary for this release, but something to think about.


This does not seem necessary for the initial release.

- Currently (in this patch), we check a collation's version when it is
first used.  But, say, after pg_upgrade, you might want to check all of
them right away.  What might be a good interface for that?  (Possibly,
we only have to check the ones actually in use, and we have dependency
information for that.)

How about adding a SQL level function for checking this which can becalled by pg_upgrade?


= Review

Here is an initial review. I will try to find some time to do moretesting later this week.

This is a really useful feature given the poor quality of collationsupport libc. Just that ICU versions the encodings is huge, and thelarger range of available collations with high configurability.Additionally being able to use abbreviated keys again would be huge.

ICU also supports writing your own collations and modifying existingones, something which might be possible to expose in the future. Ingeneral ICU offers a lot for users who care about the details of textcollation.


== Functional

- Generally things seem to work fine and as expected.

- I get a test failures in the default test suite due to not having thetr_TR locale installed. I would assume that this would be pretty commonfor hackers.


***************
*** 428,443 ****

  -- to_char
  SET lc_time TO 'tr_TR';
  SELECT to_char(date '2010-04-01', 'DD TMMON YYYY');
     to_char
  -------------
!  01 NIS 2010
  (1 row)

  SELECT to_char(date '2010-04-01', 'DD TMMON YYYY' COLLATE "tr%icu");
     to_char
  -------------
!  01 NİS 2010
  (1 row)

  -- backwards parsing
--- 428,444 ----

  -- to_char
  SET lc_time TO 'tr_TR';
+ ERROR:  invalid value for parameter "lc_time": "tr_TR"
  SELECT to_char(date '2010-04-01', 'DD TMMON YYYY');
     to_char
  -------------
!  01 APR 2010
  (1 row)

  SELECT to_char(date '2010-04-01', 'DD TMMON YYYY' COLLATE "tr%icu");
     to_char
  -------------
!  01 APR 2010
  (1 row)

  -- backwards parsing

======================================================================

- The code no longer compiles without HAVE_LOCALE_T.

- I do not like how it is not obvious which is the default version ofevery locale. E.g. I believe that "sv%icu" is "sv_reformed%icu" and not"sv_standard%icu" as one might expect. Is this inherent to ICU orsomething we can work around?

- ICU talks about a new syntax for locale extensions (old:de_DE@collation=phonebook, new: de_DE_u_co_phonebk) at this pagehttp://userguide.icu-project.org/collation/api. Is this something weneed to car about? It looks like we currently use the old format, andwhile I personally prefer it I am not sure we should rely on an old syntax.


- I get an error when creating a new locale.

#CREATE COLLATION sv2 (LOCALE = 'sv');
ERROR:  could not create locale "sv": Success

# CREATE COLLATION sv2 (LOCALE = 'sv');
ERROR:  could not create locale "sv": Resource temporarily unavailable
Time: 1.109 ms

== Code review

- For the collprovider is it really correct to say that 'd' is thedefault value as it does in catalogs.sgml?

- I do not know what the policy for formatting the documentation is, butsome of the paragraphs are in need of re-wrapping.

- Add a hint to "ERROR: conflicting or redundant options". The errormessage is pretty unclear.

- I am not a fan of this patch comparing the encoding with a -1 literal.How about adding -1 as a value to the enum? See the example below for aplace where the patch compares with -1.


            ereport(NOTICE,
                (errcode(ERRCODE_DUPLICATE_OBJECT),

- errmsg("collation \"%s\" for encoding \"%s\" alreadyexists, skipping",

-                       collname, pg_encoding_to_char(collencoding))));
+                collencoding == -1
+                ? errmsg("collation \"%s\" already exists, skipping",
+                         collname)

+ : errmsg("collation \"%s\" for encoding \"%s\" alreadyexists, skipping",

+                         collname, pg_encoding_to_char(collencoding))));
            return InvalidOid;

- The patch adds "FIXME" in the below. Is that a left-over fromdevelopment or something which still needs doing?

/*

* Also forbid matching an any-encoding entry. This test of courseis not

     * backed up by the unique index, but it's not a problem since we don't
-    * support adding any-encoding entries after initdb.
+    * support adding any-encoding entries after initdb. FIXME
     */

- Should functions like normalize_locale_name() be renamed to indicatethey relate to libc locales? I am leaning towards doing so but have notlooked closely at the task.


Andreas


--
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ICU integration

Reply via email to