Aaron Stone wrote:
On Thu, 2005-11-10 at 13:26 -0800, Robert Fleming wrote:
This has been discussed a couple times before. I've tried to summarize
it here:
http://www.dbmail.org/dokuwiki/doku.php?id=unicode_postgresql_database
Bug 218 had to do with problems with Unicode encoding prior to
PostgreSQL 8.1. But everything else sounds like it's to do with the very
nature of proper encodings in general. Is there still a version
dependent component to this issue?
I'm not sure that bug 218 was related to the Unicode fixes in PostgreSQL
8.1 -- attempting to store an ISO 8859-1 string (with octets > 127) in a
UNICODE database would fail with all recent versions of PostgreSQL. But
at the same time I can't make out what exactly happened for the bug
reporter. His message was "Content-Transfer-Encoding: 7bit", thus
/should/ not have had any octets outside the US-ASCII range -- thus
would be storable in a UNICODE db.
It seems to me that these are all the same problem: putting an invalid
UTF-8 sequence in a "text" field in a database with UNICODE encoding.
IMHO, the database should be asked to just store raw octets as they're
received from the Internet (as you mentioned, there are no guarantees
that received messages will not have encoding anomalies). So asking
the database to do automatic encoding conversions via the "client
encoding" mechanism is just going to cause problems (would need to
guarantee perfect round-tripping of conversions, e.g. to preserve
digital signatures).
I would say that in general, this dbmail issue is not dependent on
PostgreSQL version because no recent PostgreSQL version would have
allowed illegal UTF-8 sequences in UNICODE databases.
Robert