Re: [HACKERS] proposal: UTF8 to_ascii function

Jan Urbański Mon, 11 Aug 2008 07:13:55 -0700

Andrew Dunstan wrote:

Jan Urbański wrote:
Andrew Dunstan wrote:
Pavel Stehule wrote:
What you have not said is how you propose to convert UTF8 to ASCII.
Currently to_ascii() converts a small number of single byte charsetsto ASCII by folding the chars with high bits set, so what we get is apure ASCII result which is safe in any server encoding, as they areall ASCII supersets.
But what conversion rule will you use for the gazillions of Unicodecharacters?
I honestly do not understand the use case for this at all.
I do. Often clients want their searches to beaccented-or-language-specific letters insensitive. So searching for'łódź' returns 'lodz'. So the use case is there (in fact, the lack ofsuch facility made me consider not upgrading particular client to8.3...).
Or maybe there's a better way to do it?
Well, my first question would be "Why aren't you using a databaseencoding that supports to_ascii()?"

Because I want UTF-8 in it ;) It's mostly LATIN2, but clients sometimesinput Cyrillic, Greek or Hebrew letters, and sometimes use Unicodecharacters like (U+2026) HORIZONTAL ELLIPSIS.


I'd like to have
to_ascii(text, [error_handling]) returns text

So no bytea, to_ascii would accept text that's legal in my currentdatabase encoding and return text in that encoding. And error_handlingwould be something like:- 'error' (the default, throw an error if a character is untranslable toASCII)

- 'ignore' (omit untranslable characters)

- 'transliterate' (do your best to transliterate the character, or leaveit as it is if impossible).


Examples would include (assuming UTF-8 database)
to_ascii('łódź') -> 'lodz'
to_ascii('china is written 中國') -> ERROR
to_ascii('china is written 中國', 'ignore') -> 'china is written '

to_ascii('china is written 中國', 'transliterate') -> 'china is writtenzhong guo' (in an ideal world)to_ascii('china is written 中國', 'transliterate') -> 'china is written中國' (in reality)\


These would have the property, that:
to_ascii(X, 'ignore') is always pure ASCII data and never throws an error

to_ascii(X, 'transliterate') is sometimes non-ASCII data and neverthrows an error

to_ascii(X) is sometimes non-ASCII data and sometimes throws an error

It's something like PHP's iconv that can have //TRANSLIT or somesuch(forgive me for giving PHP as an example...). Now I'd love to hearpeople punch holes in my daydreaming design ;)


Cheers,
Jan

--
Jan Urbanski
GPG key ID: E583D7D2

ouden estin


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] proposal: UTF8 to_ascii function

Reply via email to