Re: [HACKERS] ICU integration

Petr Jelinek Wed, 21 Sep 2016 20:19:40 -0700

On 31/08/16 04:46, Peter Eisentraut wrote:
> Here is a patch I've been working on to allow the use of ICU for sorting
> and other locale things.
>


Hi, first pass review of the patch (somewhat high level at this point).

First, there was some discussion about ICU versioning and collation
versioning and support for multiple versions of ICU at the same time. My
opinion on this is that the approach you have chosen is good, at least
for first iteration of the feature. We might want to support multiple
versions of the library in the future (I am not all that convinced of
that though), but we definitely don't at the moment. It is important to
have version checking against the data directory which you have done.

> 
> What I have done is extend collation objects with a collprovider column
> that tells whether the collation is using POSIX (appropriate name?) or
> ICU facilities.  The pg_locale_t type is changed to a struct that
> contains the provider-specific locale handles.  Users of locale
> information are changed to look into that struct for the appropriate
> handle to use.
> 

This sounds reasonable. I would prefer the POSIX to be named something
like SYSTEM though (think windows).

> In initdb, I initialize the default collation set as before from the
> `locale -a` output, but also add all available ICU locales with a "%icu"
> appended (so "fr_FR%icu").  I suppose one could create a configuration
> option perhaps in initdb to change the default so that, say, "fr_FR"
> uses ICU and "fr_FR%posix" uses the old stuff.

I wonder if we should have both %icu and %posix unconditionally and then
have option to pick one of them for the list of collations without the
suffix (defaulting to posix for now but maybe not in the future).

> 
> That all works well enough for named collations and for sorting.  The
> thread about the FreeBSD ICU patch discusses some details of how to best
> use the ICU APIs to do various aspects of the sorting, so I didn't focus
> on that too much.

That's something for follow-up patches IMHO.

> I'm not sure how well it will work to replace all the bits of LIKE and
> regular expressions with ICU API calls.  One problem is that ICU likes
> to do case folding as a whole string, not by character.  I need to do
> more research about that.  

Can't we use the regular expression api provided by ICU, or is that too
incompatible?

> Another problem, which was also previously
> discussed is that ICU does case folding in a locale-agnostic manner, so
> it does not consider things such as the Turkish special cases.  This is
> per Unicode standard modulo weasel wording, but it breaks existing tests
> at least.
> 

I am quite sure it will break existing usecases as well. Don't know if
that's an issue if we keep multiple locale providers though. It's
expected that you get different behavior from them.

> 
> Where it gets really interesting is what to do with the database
> locales.  They just set the global process locale.  So in order to port
> that to ICU we'd need to check every implicit use of the process locale
> and tweak it.  We could add a datcollprovider column or something.  But
> we also rely on the datctype setting to validate the encoding of the
> database.  Maybe we wouldn't need that anymore, but it sounds risky.

I think we'll need to separate the process locale and the locale used
for string operations.

> 
> We could have a datcollation column that by OID references a collation
> defined inside the database.  With a background worker, we can log into
> the database as it is being created and make adjustments, including
> defining or adjusting collation definitions.  This would open up
> interesting new possibilities.

I don't really follow this but it sounds icky.

> 
> What is a way to go forward here?  What's a minimal useful feature that
> is future-proof?  Just allow named collations referencing ICU for now?

I think that's minimal useful feature indeed, it will move us forward,
especially on systems where posix locales are not that well implemented
but will not be world changer just yet.


> Is throwing out POSIX locales even for the process locale reasonable?
> 

I don't think we can really do that, we'll need some kind of mapping
table between system locales and ICU locales (IIRC we build something
like that on windows as well). Maybe that mapping can be used for the
datctype to process locale conversion (datctype would then be provider
specific).

We also need to keep posix locale support anyway otherwise we'd break
pg_upgrade I think (at leas to the point that pg_upgrade where you have
to reindex whole database is much less feasible).

> Oh, that case folding code in formatting.c needs some refactoring.
> There are so many ifdefs there and it's repeated almost identically
> three times, it's crazy to work in that.

Yuck, yes it wasn't pretty before and it's quite unreadable now, I think
this patch will have to do something about that.

The message about collate provider version change could use some
improvements though, ie we could try to find affected indexes, also
there might be check constraints affected, etc. If nothing else it
deserves explanation in the docs.

The icu conversion functions could use some docs.

I wonder if some of the pieces of the code with following pattern:
    if (mylocale->provider == COLLPROVIDER_ICU)
        do lot's of icu stuff
    else
        call posix function
should move the icu code to separate functions, it would certainly
improve overall readability.

-- 
  Petr Jelinek                  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ICU integration

Reply via email to