Andrew Dunstan wrote:
Jan Urbański wrote:
Andrew Dunstan wrote:
Pavel Stehule wrote:
What you have not said is how you propose to convert UTF8 to ASCII.
Currently to_ascii() converts a small number of single byte charsets
to ASCII by folding the chars with high bits set, so what we get is a
pure ASCII result which is safe in any server encoding, as they are
all ASCII supersets.
But what conversion rule will you use for the gazillions of Unicode
characters?
I honestly do not understand the use case for this at all.
I do. Often clients want their searches to be
accented-or-language-specific letters insensitive. So searching for
'łódź' returns 'lodz'. So the use case is there (in fact, the lack of
such facility made me consider not upgrading particular client to
8.3...).
Or maybe there's a better way to do it?
Well, my first question would be "Why aren't you using a database
encoding that supports to_ascii()?"
Because I want UTF-8 in it ;) It's mostly LATIN2, but clients sometimes
input Cyrillic, Greek or Hebrew letters, and sometimes use Unicode
characters like (U+2026) HORIZONTAL ELLIPSIS.
I'd like to have
to_ascii(text, [error_handling]) returns text
So no bytea, to_ascii would accept text that's legal in my current
database encoding and return text in that encoding. And error_handling
would be something like:
- 'error' (the default, throw an error if a character is untranslable to
ASCII)
- 'ignore' (omit untranslable characters)
- 'transliterate' (do your best to transliterate the character, or leave
it as it is if impossible).
Examples would include (assuming UTF-8 database)
to_ascii('łódź') -> 'lodz'
to_ascii('china is written 中國') -> ERROR
to_ascii('china is written 中國', 'ignore') -> 'china is written '
to_ascii('china is written 中國', 'transliterate') -> 'china is written
zhong guo' (in an ideal world)
to_ascii('china is written 中國', 'transliterate') -> 'china is written
中國' (in reality)\
These would have the property, that:
to_ascii(X, 'ignore') is always pure ASCII data and never throws an error
to_ascii(X, 'transliterate') is sometimes non-ASCII data and never
throws an error
to_ascii(X) is sometimes non-ASCII data and sometimes throws an error
It's something like PHP's iconv that can have //TRANSLIT or somesuch
(forgive me for giving PHP as an example...). Now I'd love to hear
people punch holes in my daydreaming design ;)
Cheers,
Jan
--
Jan Urbanski
GPG key ID: E583D7D2
ouden estin
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers