Thanks. That makes sense. The default client encoding on the reports database is ISO-8859-8, so I guess when I don't set it using \encoding, it does exactly what you say.
OK, so I'm still looking for a way to convert illegal characters into something that won't collide with my encoding (asterisks or whatever). Thank you, Herouth On 21/07/2012, at 15:36, Craig Ringer wrote: > On 07/21/2012 04:59 PM, Herouth Maoz wrote: >> I am using Postgresql 8.3.14 on our reporting system. There are scripts that >> collect data from many databases across the firm into this database. >> Recently I added tables from a particular database which has encoding UTF-8. >> > > First, I know there have been encoding and UTF-8 handling fixes since 8.3 . > It'd be interesting to see if this still happens on a more recent version. > > You're also missing five bug-fix point-releases in the 8.3 series, as the > latest is 8.3.19 . See: > http://www.postgresql.org/docs/8.3/static/release.html > for fixes you're missing. > > Explanation for what I think is going on below: > >> >> But this puzzles me, because I then took the file >> > ... which was created with a \copy with client encoding set to utf-8, right? >> ran psql and \copy <table> from file >> > > With which client encoding set? UTF-8 or ISO_8859_8? I bet you copied it in > with ISO_8859_1. >> And it worked. I tried it again now, and I can see the row with its Arabic >> content, even though it is not in the database encoding. > It shows up correctly? > > If you \copy a dump in utf-8, then \copy it back in with ISO_8859_8, it > should be mangled. > > If you set your client_encoding to utf_8 ("\encoding utf-8") does it still > show up correctly? I suspect it's wrong in the database and you're just > unmangling it on display. > > It would help if you would actually show the bytes of: > > - The chars in the \copy dump, using `xxd' or similar > - The chars in the database before the copy out and copy in, using a CAST to > `bytea` > - The chars in the database AFTER the copy out and copy in, again with a CAST > to `bytea` > > ... as well as the database encoding, NOT just the client encoding (see > below): >> I checked \encoding. It replies >> ISO_8859_8 > That is the client encoding. > > Try: > > \l+ > > to list databases. You'll see the database encoding there. The same info is > available from: > > SELECT datname, encoding from pg_database WHERE datname = 'mydatabase'; > > > Maybe this demo will help enlighten you. > > regress=# select version(); > version > ------------------------------------------------------------------------------------------------------------- > PostgreSQL 9.1.4 on x86_64-redhat-linux-gnu, compiled by gcc (GCC) 4.7.0 > 20120507 (Red Hat 4.7.0-5), 64-bit (1 row) regress=# \l+ > List of databases > Name | Owner | Encoding | Collate | Ctype | > Access privileges | Size | Tablespace | Description > > --------------------+------------+----------+-------------+-------------+---------------------------+---------+------------+-------------------------------------------- > .... regress | craig | UTF8 | en_US.UTF-8 | > en_US.UTF-8 | | 41 MB | pg_default | regress=# > CREATE TABLE enctest (a serial, x text); CREATE TABLE regress=# -- Some text > randomly pulled off Google News taiwan, since it was convenient and the exact > text doesn't matter regress=# insert into enctest (x) values ('心情已平復很多」。 > 中國網絡電視台報導'); INSERT 0 1 regress=# \encoding UTF8 regress=# \copy ( select x > from enctest ) to enctest.csv Set a 1-byte non-utf encoding, doesn't really > matter which one. Then import the data we dumped as utf-8. regress=# > \encoding latin-1 regress=# \copy enctest(x) from enctest.csv enctest now > contains two rows. One is the correctly encoded original, one is the dumped > and reloaded one. We can't view the whole table while we're in latin-1 > encoding because the correct row won't translate right. regress=# select * > from enctest; ERROR: character 0xe5bf83 of encoding "UTF8" has no equivalent > in "LATIN1" but we *CAN* view the second row we dumped as utf-8 then imported > as latin-1: regress=# regress=# select * from enctest where a = 2; a | > x > ---+--------------------------------------------------------- 2 | > 心情已平復很多」。 中國網絡電視台報導 (1 row) regres At this point you're probably thinking > "WTF!?!". It shows up correctly in my terminal because my terminal is utf-8, > irrespective of the encoding set in psql. Setting a non-utf-8 encoding in > psql via "\encoding" just lies to psql about the encoding of the bytes I > paste in on my terminal. It receives a byte sequence that could be valid > latin-1- though it's actually nonsense garbage, it can't tell the difference. > It trusts me and translates the "latin-1" I sent into utf-8 for storage. When > I ask for it back again and I'm in a latin-1 client encoding, it converts > that utf-8 back into latin-1 - or that's what it thinks it's doing. It's > actually demangling mangled utf-8 so my console can display it. We can > undestand this better if we examine what's actually in the database. > regress=# \encoding utf-8 regress=# select * from enctest; a | > x > > ---+------------------------------------------------------------------------------------------------------------------------------------ > 1 | 心情已平復很多」。 中國網絡電視台報導 2 | > å¿\u0083æ\u0083\u0085已平復å¾\u0088å¤\u009Aã\u0080\u008Dã\u0080\u0082 > ä¸å\u009C\u008B網絡é\u009B»è¦\u0096å\u008F°å ±å°\u008E (2 rows) regress=# > select a, x::bytea from enctest; a | > x > > > ---+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > -------------------------------------------------- 1 | > \xe5bf83e68385e5b7b2e5b9b3e5bea9e5be88e5a49ae3808de3808220e4b8ade59c8be7b6b2e7b5a1e99bbbe8a696e58fb0e5a0b1e5b08e > 2 | > \xc3a5c2bfc283c3a6c283c285c3a5c2b7c2b2c3a5c2b9c2b3c3a5c2bec2a9c3a5c2bec288c3a5c2a4c29ac3a3c280c28dc3a3c280c28220c3a4c2b8c2adc3a5c29cc28bc3a7c2b6c2b2c3a7c2b5c2a1c3a9c29bc2b > bc3a8c2a6c296c3a5c28fc2b0c3a5c2a0c2b1c3a5c2b0c28e (2 rows) This should help. > See how the first row shows up correctly when we're in the right client > encoding, but the second one is gibberish? That's what UTF-8 that's been > interpreted as latin-1 and "converted" into utf-8 looks like. It's even more > informative if I start a terminal in latin-1 and then set client_encoding to > latin-1, so I'm not lying to psql about my client encoding. $ > LANG=en_AU.iso55891 LC_ALL=en_AU.iso88591 xterm $ psql regress regress=# > \encoding LATIN-1 regress=# select x from enctest where a = 2; > x > --------------------------------------------------------- å¿æ ã > ä¸å網絡é¦å°å ±å° (1 row)¾©å¾å¤ã Note that latin-1 is a synonym for > iso-8859-1. Now you can see that the data in the DB is actually mangled. It's > just that when you accidentally lie to Pg the same way in the input and > output phases as I've shown, it *looks* ok though the byte sequence in the > database is garbage. Here's yet another way to illustrate it. Let's take the > first char of our input and see what data is produced when we convert its > utf-8 byte using a conversion function from latin-1 into utf-8. regress=# > select '心'::bytea, convert( '心'::bytea, 'latin-1', 'utf-8') from enctest; > bytea | convert ----------+---------------- \xe5bf83 | > \xc3a5c2bfc283 (1 row) Look familiar from the example above? See how the > valid utf-8 sequence in the first col gets converted into garbage in the > second col? Yet we can reverse the incorrect conversion to get valid utf-8 > again: regress=# select convert( BYTEA '\xc3a5c2bfc283', 'utf-8', 'latin-1'); > convert ---------- \xe5bf83 (1 row) ... which is probably what you've > been doing. >> >> What's happening here? Why does the database accept input in the wrong >> encoding and doesn't shout when I then try to select that input? >> > Correct, because legacy 1-byte encodings cannot be verified. There's no way > to say "Yup, this is latin-1" or "Yup, this is ISO-8859-8". > > Don't set client_encoding unless you REALLY know encodings. If you do set it > for \copy, make sure you always \copy in with the same encoding you used for > the \copy out. PostgreSQL cannot protect you from this. > > -- > Craig Ringer > -- חרות מעוז יוניסל פתרונות סלולריים מתקדמים ☎ 03-5181717 שלוחה 742