Re: [HACKERS] invalidly encoded strings

db Mon, 10 Sep 2007 23:37:18 -0700

>> Try the sequence below. Then, try to dump and then reload the database.
>> When you try to reload it, you will get an error:
>>
>> ERROR:  invalid byte sequence for encoding "UTF8": 0xbd
>
> I know this could be a problem (like chr() with invalid byte pattern).


And that's enough of a problem already. We don't need more problems.

> What I really want to know is, read query something like this:
>
> SELECT * FROM japanese_table ORDER BY convert(japanese_text using
> utf8_to_euc_jp);
>
> could be a problem (I assume we use C locale).

If convert() produce a sequence of bytes that can't be interpreted as a
string in the server encoding then it's broken. Imho convert() should
return a bytea value. If we hade good encoding/charset support we could do
better, but we can't today.

The above example would work fine if convert() returned a bytea. In the C
locale the string would be compared byte for byte and that's what you get
with bytea values as well.

Strings are not sequences of bytes that can be interpreted in different
ways. That's what bytea values are. Strings are in a specific encoding
always, and in pg that encoding is fixed to a single one for a whole
cluster at initdb time. We should not confuse text with bytea.

/Dennis


---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

               http://archives.postgresql.org

Re: [HACKERS] invalidly encoded strings

Reply via email to