Aaron Stone wrote:

On Thu, 2005-11-10 at 13:26 -0800, Robert Fleming wrote:

This has been discussed a couple times before. I've tried to summarize it here:

http://www.dbmail.org/dokuwiki/doku.php?id=unicode_postgresql_database

Bug 218 had to do with problems with Unicode encoding prior to
PostgreSQL 8.1. But everything else sounds like it's to do with the very
nature of proper encodings in general. Is there still a version
dependent component to this issue?
I'm not sure that bug 218 was related to the Unicode fixes in PostgreSQL 8.1 -- attempting to store an ISO 8859-1 string (with octets > 127) in a UNICODE database would fail with all recent versions of PostgreSQL. But at the same time I can't make out what exactly happened for the bug reporter. His message was "Content-Transfer-Encoding: 7bit", thus /should/ not have had any octets outside the US-ASCII range -- thus would be storable in a UNICODE db.

It seems to me that these are all the same problem: putting an invalid UTF-8 sequence in a "text" field in a database with UNICODE encoding. IMHO, the database should be asked to just store raw octets as they're received from the Internet (as you mentioned, there are no guarantees that received messages will not have encoding anomalies). So asking the database to do automatic encoding conversions via the "client encoding" mechanism is just going to cause problems (would need to guarantee perfect round-tripping of conversions, e.g. to preserve digital signatures).

I would say that in general, this dbmail issue is not dependent on PostgreSQL version because no recent PostgreSQL version would have allowed illegal UTF-8 sequences in UNICODE databases.

Robert

Reply via email to