[GENERAL] 8.0, UTF8, and CLIENT_ENCODING

Paul Ramsey Thu, 17 May 2007 14:00:31 -0700

I have a small database (PgSQL 8.0, database encoding UTF8) that folksare inserting into via a web form. The form itself is declaredISO-8859-1 and the prior to inserting any data, pg_client_encoding isset to LATIN1.

Most of the high-bit characters are correctly translated from LATIN1 toUTF8. So for e-accent-egu I see the two-byte UTF8 value in the database.

Sometimes, in their wisdom, people cut'n'paste information out of MSWordand put that in the form. Instead of being mapped to 2-byte UTF8high-bit equivalents, they are going into the database directly asone-byte values > 127. That is, as illegal UTF8 values.

When I try to dump'n'restore this database into PgSQL 8.2, my data can'tmade the transit.

Firstly, is this "kinda sorta" encoding handling expected in 8.0, or didI do something wrong?

Secondly, anyone know any useful tools to pipe a stream through to stripout illegal UTF8 bytes, so I can pipe my dump through that rather thanhand editing it?


Thanks,

Paul

--

  Paul Ramsey
  Refractions Research
  http://www.refractions.net
  [EMAIL PROTECTED]
  Phone: 250-383-3022
  Cell: 250-885-0632

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

[GENERAL] 8.0, UTF8, and CLIENT_ENCODING

Reply via email to