[Please CC any replies, thanks] This patch is the beginnings of support for COLLATE. I need to do some other work for a few days so I'm posting here to get some initial reviews. Various parts are marked [done] and [not done].
The steps involved are: - Add COLLATE to grammer as part of expression tree. New CollateClause node stores the relevent facts. [done] - parse_expr goes through tree and determines the appropriate COLLATE state for each node as per SQL spec. [done] Note: PostgreSQL doesn't really have a way to identify text-like types and there's no real need to anyway. The implementation allows any node to have a collate state. This can be used for other things, see below. - CREATE COLLATE statement [not done, currently added on the fly] - Two new datatypes, pg_locale_t and pg_localedata_t. The former represents a locale (and will eventually have an OID). The latter is an anonymous cookie that stores locale specific information and is passed to functions that need it. It may be that once done, this latter type will vanish again. [done] - Several utility functions in the new file pg_xlocale.c for use by the rest of the system. [done] - Add boolean column 'proislocalized' to pg_proc which indicates if the output of this function is affected by the LOCALE. eg textcat doesn't care but textle does. [done] This is for: a) So the parser can complain about functions that look at the locale/collate order but it's not clear from the arguments (state None per SQL spec) and it's not specified explicitly. [not done] b) So when a column or function is indexed, the index code knows if the locale is relevent to sorting order. This is particularly interesting for btree indexes on character strings. [not done] Currently I've marked 58 functions as being locale sensetive, but that list will need careful going over. To some extent it can be checked automatically by examining which backend functions use the new PG_GETLOCALE() macro. - Check for correct encoding in loaded locales [not done] - Check the partial indexes do the right thing when matching expressions. [not done] - Docs, regression, etc... - make check: right now I'm getting some regressions in the rules and plpgsql, very odd... Possibly due to the fact that rules get collate nodes with locales that don't persist across invokations. Goals: - This setup extends the SQL spec a bit, in the sense that COLLATE can be attached to anything. It is my intention to allow functions such as to_char() and to_timestamp() to be localized. eg: test=# select cash_out('1.00'::money collate 'nl_NL'), cash_out('1.00'::money collate 'en_AU'); cash_out | cash_out ----------+---------- EUR1,00 | $1.00 (1 row) - Should LOCALE be created as a synonym for COLLATE? It reads more naturally. - Currently LC_COLLATE is fixed at initdb and LC_NUMERIC and LC_CURRENCY can be altered. The idea is that eventually even LC_COLLATE can be altered anytime (from the users point view anyway), as any objects that care will store the collate order they want and can execute functions as appropriate. - Eventually once the transition to full locale support is complete, change the backend so it always runs under locale C and only functions on userdata are affected by locales. This means that unquoted identifiers when converted to lowercase will be lowered as per ASCII rules. Judging by  this is what people want, but feedback would be nice.  http://firstname.lastname@example.org/msg39742.html Hence, right now, names compare using normal strcmp, but text, varchar, etc use strcoll. - The patch as it currently stands won't compile at all without system xlocale support. The goal is to provide some level of backward compatability where the COLLATE clause can be used, but only to affect functions like to_char(). Changing COLLATE order would be forbidden (ie just like now). - Eventually, once the parts relevent to locales are sufficiently abstracted, look into ICU to plug it in. Unfortunatly, it's a completely different model (all utf-16) so that's another phase altogether. - The default type output functions should never be locale specific. This is to avoid issues with pgdump and frontends. Create a function 'localize(anyelement)' to "do the obvious" to force it to happen. Download: Compressed 57K, uncompressed >500K but that's due to rewriting the whole of pg_proc. The important code is not so big. Against todays CVS. http://svana.org/kleptog/pgsql/collate2.patch.gz Examples: test=# SELECT text('a') < text('B') COLLATE 'C', text('a') < text('B') COLLATE 'en_US.UTF-8'; ?column? | ?column? ----------+---------- f | t (1 row) test=# SELECT text('A') < text('b') COLLATE 'C', text('A') < text('b') COLLATE 'en_US.UTF-8'; ?column? | ?column? ----------+---------- t | t (1 row) test=# SELECT text('A') COLLATE 'en_US.UTF-8' < text('b') COLLATE 'C'; ERROR: Conflicting COLLATE clauses Have a nice day, -- Martijn van Oosterhout <email@example.com> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
Description: PGP signature