Am 20.07.2013 um 17:25 schrieb Marcin Mirosław <[email protected]>: > W dniu 2013-07-20 14:17, Axel Rau pisze: >> Recording of utf-8 characters from headers in mainlog and PostgreSQL DB via >> lookup usually works flawlessly. >> >> Occasionally PostgreSQL complains during INSERT of header items or main log >> events (our log host uses PostgreSQL as bakend) about invalid byte sequence, >> like here: >> >> [1\3] 1V085d-00067H-9X H=mail03.noris.net [62.128.1.223] Warning: ACL "warn" >> statement skipped: condition test deferred: PGSQL: query failed: ERROR: >> invalid byte sequence for encoding "UTF8": 0xfc >> >> 2013-07-19T10:39:10.005396+00:00 db1 rsyslogd: db error (22021): invalid >> byte sequence for encoding "UTF8": 0xfc >> 2013-07-19T10:39:10.005415+00:00 db1 rsyslogd: db error (event): >> |2013-07-19t10:39:09.991124+00:00|6|2|mx4|exim| [2\3] (PGRES_FATAL_ERROR) >> (SELECT * FROM record_Reception( '1525916', '1V085d-00067H-9X', >> 'Staatstheater Nürnberg <[email protected]>', 'Newsletter >> Staatstheater Nürnberg', 'none', 'N/A')) >> >> Does this come from bad encoding of original mail headers? >> Is there an easy solution to skip bad characters before sending them to the >> DB? >> >> In lokkups/pgsql.c:258 I see: >> PQsetClientEncoding(pg_conn, "SQL_ASCII"); >> >> but I think it's not related. > > Hi Axel! > I suspect it is related. If you try to insert text into postgresql you > should know which encoding is used in this text. If you know the > inserted text is in utf-8 you should use set "clientencoding" to utf-8. > But in emails you never know what encoding will be used. In theory it > should be used only basic ASCII characters. > You can: > a) rejects mail with non ASCII chars in Subject. > b) encode Subject using e.g. base64 then inserts to database > c) guess which encoding was used in Subject, then set properly > "clientencoding" parameter > d) use "C" collation for given database/table in postgresql - it allows > you to insert any characters into table. But you will lost possibility > to get tuple in your preffered charset. (E.g. you can keep text in utf-8 > in database but when you set "clientencoding" to e.g. 8859-2 you will > get text in 8859-2. With "C" collation pgsql doesn't do encoding to e.g > iso8859-2) As exim works with utf-8 strings, my naive assumption was, that a header like Subject: Neue =?ISO-8859-1?q?Gl=E4ser?= (RFC 2047) will be converted to utf-8 by exim before I access it via $h_Subject: . Looking at the complexity of expand.c, this seems to be proved. Can anybody confirm this?
If the header contains none-ASCII 8-bit-characters (=illegal), I would like exim to replace them by "?". Can this be done in the exim config or do we need a new expansion function for that? I must ensure valid utf-8 at the DB interface. Axel --- PGP-Key:29E99DD6 ☀ +49 151 2300 9283 ☀ computing @ chaos claudius -- ## List details at https://lists.exim.org/mailman/listinfo/exim-dev Exim details at http://www.exim.org/ ##
