[PATCHES] For review: Initial support for COLLATE

Martijn van Oosterhout Wed, 07 Sep 2005 09:13:44 -0700

[Please CC any replies, thanks]

This patch is the beginnings of support for COLLATE. I need to do some
other work for a few days so I'm posting here to get some initial
reviews. Various parts are marked [done] and [not done].


The steps involved are:

- Add COLLATE to grammer as part of expression tree. New CollateClause
  node stores the relevent facts. [done]

- parse_expr goes through tree and determines the appropriate COLLATE
  state for each node as per SQL spec. [done]

  Note: PostgreSQL doesn't really have a way to identify text-like
  types and there's no real need to anyway. The implementation allows
  any node to have a collate state. This can be used for other things,
  see below.

- CREATE COLLATE statement [not done, currently added on the fly]

- Two new datatypes, pg_locale_t and pg_localedata_t. The former
  represents a locale (and will eventually have an OID). The latter is
  an anonymous cookie that stores locale specific information and is
  passed to functions that need it. It may be that once done, this
  latter type will vanish again. [done]

- Several utility functions in the new file pg_xlocale.c for use by the
  rest of the system. [done]

- Add boolean column 'proislocalized' to pg_proc which indicates if the
  output of this function is affected by the LOCALE. eg textcat doesn't
  care but textle does. [done] This is for:

  a) So the parser can complain about functions that look at the
     locale/collate order but it's not clear from the arguments (state
     None per SQL spec) and it's not specified explicitly. [not done]

  b) So when a column or function is indexed, the index code knows if
     the locale is relevent to sorting order. This is particularly
     interesting for btree indexes on character strings. [not done]

  Currently I've marked 58 functions as being locale sensetive, but
  that list will need careful going over. To some extent it can be
  checked automatically by examining which backend functions use the
  new PG_GETLOCALE() macro.

- Check for correct encoding in loaded locales [not done]

- Check the partial indexes do the right thing when matching
  expressions. [not done]

- Docs, regression, etc...

- make check: right now I'm getting some regressions in the rules
  and plpgsql, very odd... Possibly due to the fact that rules get
  collate nodes with locales that don't persist across invokations.

Goals:

- This setup extends the SQL spec a bit, in the sense that COLLATE can
  be attached to anything. It is my intention to allow functions such
  as to_char() and to_timestamp() to be localized. eg:

test=# select cash_out('1.00'::money collate 'nl_NL'), cash_out('1.00'::money 
collate 'en_AU');
 cash_out | cash_out 
----------+----------
 EUR1,00  | $1.00
(1 row)

- Should LOCALE be created as a synonym for COLLATE? It reads more
  naturally.

- Currently LC_COLLATE is fixed at initdb and LC_NUMERIC and
  LC_CURRENCY can be altered. The idea is that eventually even
  LC_COLLATE can be altered anytime (from the users point view anyway),
  as any objects that care will store the collate order they want and
  can execute functions as appropriate.

- Eventually once the transition to full locale support is complete,
  change the backend so it always runs under locale C and only
  functions on userdata are affected by locales. This means that
  unquoted identifiers when converted to lowercase will be lowered as
  per ASCII rules. Judging by [1] this is what people want, but
  feedback would be nice.

[1] http://www.mail-archive.com/[email protected]/msg39742.html

  Hence, right now, names compare using normal strcmp, but text,
  varchar, etc use strcoll.

- The patch as it currently stands won't compile at all without system
  xlocale support. The goal is to provide some level of backward
  compatability where the COLLATE clause can be used, but only to
  affect functions like to_char(). Changing COLLATE order would be
  forbidden (ie just like now).

- Eventually, once the parts relevent to locales are sufficiently
  abstracted, look into ICU to plug it in. Unfortunatly, it's a
  completely different model (all utf-16) so that's another phase
  altogether.

- The default type output functions should never be locale specific.
  This is to avoid issues with pgdump and frontends. Create a function
  'localize(anyelement)' to "do the obvious" to force it to happen.

Download: 
  Compressed 57K, uncompressed >500K but that's due to rewriting the
  whole of pg_proc. The important code is not so big. Against todays
  CVS.

http://svana.org/kleptog/pgsql/collate2.patch.gz

Examples:

test=# SELECT text('a') < text('B') COLLATE 'C', text('a') < text('B') COLLATE 
'en_US.UTF-8';
 ?column? | ?column? 
----------+----------
 f        | t
(1 row)

test=# SELECT text('A') < text('b') COLLATE 'C', text('A') < text('b') COLLATE 
'en_US.UTF-8';
 ?column? | ?column? 
----------+----------
 t        | t
(1 row)

test=# SELECT text('A') COLLATE 'en_US.UTF-8' < text('b') COLLATE 'C';
ERROR:  Conflicting COLLATE clauses

Have a nice day,
-- 
Martijn van Oosterhout   <[email protected]>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

pgpDuS7ZrJtY8.pgp
Description: PGP signature

[PATCHES] For review: Initial support for COLLATE

Reply via email to