[HACKERS] Unicode Normalization

David E. Wheeler Wed, 23 Sep 2009 11:08:33 -0700

Hackers,

I just had a discussion on IRC about unicode normalization inPostgreSQL. Apparently there is not support for it, currently. AndrewGierth points out that it's part of the SQL spec to support it, though:

RhodiumToad:e.g.  NORMALIZE(foo,NFC,len)
justatheory:Oh, just a function then, really.
RhodiumToad:where the normal form can be any of NFC, NFD, NFKC, NFKD
RhodiumToad:except that the normal form is an identifier, not a string
RhodiumToad:also the normal form and length are optional
RhodiumToad:so NORMALIZE(foo)  is equivalent to NORMALIZE(foo,NFC)

I looked around and found the Public Software Group's utf8procproject, which even includes some PostgreSQL support (not, alas, fornormalization). It has an MIT-licensed C library that offers thesefunctions:

uint8_t utf8proc_NFD(uint8_t str)
Returns a pointer to newly allocated memory of a NFD normalizedversion of the null-terminated stringstr.
uint8_t utf8proc_NFC(uint8_t str)
Returns a pointer to newly allocated memory of a NFC normalizedversion of the null-terminated stringstr.
uint8_t utf8proc_NFKD(uint8_t str)
Returns a pointer to newly allocated memory of a NFKD normalizedversion of the null-terminated stringstr.
uint8_t utf8proc_NFKC(uint8_t str)
Returns a pointer to newly allocated memory of a NFKC normalizedversion of the null-terminated stringstr.

Anyone got any interest in porting these functions to PostgreSQL? Iguess the parser would need to be updated to support the use ofidentifiers in the NORMALIZE() function, but otherwise it should be afairly straight-forward port for an experienced C coder, no?


Best,

David

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Unicode Normalization

Reply via email to