Re: [exim-dev] Bad utf-8 in pgsql lookup and mainlog

Axel Rau Sat, 20 Jul 2013 10:07:07 -0700

Am 20.07.2013 um 17:25 schrieb Marcin Mirosław <[email protected]>:

> W dniu 2013-07-20 14:17, Axel Rau pisze:
>> Recording of utf-8 characters from headers in mainlog and PostgreSQL DB via 
>> lookup usually works flawlessly.
>> 
>> Occasionally PostgreSQL complains during INSERT of header items or main log 
>> events (our log host uses PostgreSQL as bakend) about invalid byte sequence, 
>> like here:
>> 
>> [1\3] 1V085d-00067H-9X H=mail03.noris.net [62.128.1.223] Warning: ACL "warn" 
>> statement skipped: condition test deferred: PGSQL: query failed: ERROR:  
>> invalid byte sequence for encoding "UTF8": 0xfc
>> 
>> 2013-07-19T10:39:10.005396+00:00 db1 rsyslogd: db error (22021): invalid 
>> byte sequence for encoding "UTF8": 0xfc
>> 2013-07-19T10:39:10.005415+00:00 db1 rsyslogd: db error (event): 
>> |2013-07-19t10:39:09.991124+00:00|6|2|mx4|exim| [2\3]  (PGRES_FATAL_ERROR) 
>> (SELECT * FROM record_Reception( '1525916', '1V085d-00067H-9X', 
>> 'Staatstheater Nürnberg <[email protected]>', 'Newsletter 
>> Staatstheater Nürnberg', 'none', 'N/A'))
>> 
>> Does this come from bad encoding of original mail headers?
>> Is there an easy solution to skip bad characters before sending them to the 
>> DB?
>> 
>> In lokkups/pgsql.c:258 I see:
>> PQsetClientEncoding(pg_conn, "SQL_ASCII");
>> 
>> but I think it's not related.
> 
> Hi Axel!
> I suspect it is related. If you try to insert text into postgresql you
> should know which encoding is used in this text. If you know the
> inserted text is in utf-8 you should use set "clientencoding" to utf-8.
> But in emails you never know what encoding will be used. In theory it
> should be used only basic ASCII characters.
> You can:
> a) rejects mail with non ASCII chars in Subject.
> b) encode Subject using e.g. base64 then inserts to database
> c) guess which encoding was used in Subject, then set properly
> "clientencoding" parameter
> d) use "C" collation for given database/table in postgresql - it allows
> you to insert any characters into table. But you will lost possibility
> to get tuple in your preffered charset. (E.g. you can keep text in utf-8
> in database but when you set "clientencoding" to e.g. 8859-2 you will
> get text in 8859-2. With "C" collation pgsql doesn't do encoding to e.g
> iso8859-2)
As exim works with utf-8 strings, my naive assumption was, that a header like
        Subject: Neue =?ISO-8859-1?q?Gl=E4ser?=
(RFC 2047) will be converted to utf-8 by exim before I access it via 
$h_Subject: .
Looking at the complexity of expand.c, this seems to be proved.
Can anybody confirm this?


If the header contains none-ASCII 8-bit-characters (=illegal), I would like 
exim to replace them by "?".
Can this be done in the exim config or do we need a new expansion function for 
that?

I must ensure valid utf-8 at the DB interface.

Axel
---
PGP-Key:29E99DD6  ☀ +49 151 2300 9283  ☀ computing @ chaos claudius


-- 
## List details at https://lists.exim.org/mailman/listinfo/exim-dev Exim 
details at http://www.exim.org/ ##

Re: [exim-dev] Bad utf-8 in pgsql lookup and mainlog

Reply via email to