Andrew Dunstan wrote:


Jan Urbański wrote:
Andrew Dunstan wrote:


Pavel Stehule wrote:
What you have not said is how you propose to convert UTF8 to ASCII.

Currently to_ascii() converts a small number of single byte charsets to ASCII by folding the chars with high bits set, so what we get is a pure ASCII result which is safe in any server encoding, as they are all ASCII supersets.

But what conversion rule will you use for the gazillions of Unicode characters?

I honestly do not understand the use case for this at all.

I do. Often clients want their searches to be accented-or-language-specific letters insensitive. So searching for 'łódź' returns 'lodz'. So the use case is there (in fact, the lack of such facility made me consider not upgrading particular client to 8.3...).
Or maybe there's a better way to do it?

Well, my first question would be "Why aren't you using a database encoding that supports to_ascii()?"

Because I want UTF-8 in it ;) It's mostly LATIN2, but clients sometimes input Cyrillic, Greek or Hebrew letters, and sometimes use Unicode characters like (U+2026) HORIZONTAL ELLIPSIS.

I'd like to have
to_ascii(text, [error_handling]) returns text

So no bytea, to_ascii would accept text that's legal in my current database encoding and return text in that encoding. And error_handling would be something like: - 'error' (the default, throw an error if a character is untranslable to ASCII)
- 'ignore' (omit untranslable characters)
- 'transliterate' (do your best to transliterate the character, or leave it as it is if impossible).

Examples would include (assuming UTF-8 database)
to_ascii('łódź') -> 'lodz'
to_ascii('china is written 中國') -> ERROR
to_ascii('china is written 中國', 'ignore') -> 'china is written '
to_ascii('china is written 中國', 'transliterate') -> 'china is written zhong guo' (in an ideal world) to_ascii('china is written 中國', 'transliterate') -> 'china is written 中國' (in reality)\

These would have the property, that:
to_ascii(X, 'ignore') is always pure ASCII data and never throws an error
to_ascii(X, 'transliterate') is sometimes non-ASCII data and never throws an error
to_ascii(X) is sometimes non-ASCII data and sometimes throws an error

It's something like PHP's iconv that can have //TRANSLIT or somesuch (forgive me for giving PHP as an example...). Now I'd love to hear people punch holes in my daydreaming design ;)

Cheers,
Jan

--
Jan Urbanski
GPG key ID: E583D7D2

ouden estin


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to