Emmanuel Dreyfus <m...@netbsd.org> wrote:

> I just upgraded Apache to 2.4 and RT to latest 3.8, and I get a charset
> problem: anything that enter RT through rt-mailgate is fine, but any non
> ASCII character sent through the web interface gets corrupted: I get a ?
> in a quare instead, which is usually what happens when ISO-8859-1
> character was mistaken as UTF-8.
> 
> Older messages from before the upgrade display correctly, hence this is
> really a problem at message POST time.

I fixed it. Replying to myself with the whole story for someone else's
future reference.

The problem was database encoding. RT can use PostgreSQL with encoding
"UTF-8" or the default "SQL_ASCII". That later encoding means PostgreSQL
does not care about encoding and just gives back the bytes it was given
without any check. The former enforces UTF-8 usage and is able to
automatically transcode if the client claims to use another encoding.

My RT installation had been configured with the PostgreSQL database
using "UTF-8" encoding for a while. At some time I upgraded PostgreSQL
and I reloaded the data from a dump after reinitializing the database.
But since I did not check for it, it got "SQL_ASCII", a setup where the
application must take care of data encoding.

RT stores data as UTF-8 but It seems there are some conversions missing
in the code, especially on ticket creation through the web. I did not
find where it happens, but this action was introducing ISO-8859-1
characters in the database. After a few weeks, I had a database randomly
mixing ISO-8859-1 and UTF-8 data.

Fixing the situation required to dump, drop and create again the
database with "UTF-8" encoding and reloading from the dump. But doing so
required to clean up the dump from any ISO-8859-1 character, otherwise
PostgreSQL could not load it.

Using iconv(1) could not help since there was also some UTF-8
characaters in the database. I had to write exernal C functions for
PostgreSQL to perfom query such as
update attachments set content=qpfix(content),
   contentencoding="qupoted-printable" where not is_utf8(content);

is_utf8() is an external function that finds character sequences invalid
for UTF-8
qpfix() is an external function that translates ISO-8859-1 in
quoted-printable UTF-8 

That kind of fixes had to be done in a various columns of table
attachments, users, and transactions. I can share the C code if someone
is interested. 

After the proper fix, the database dump could be reimported in the UTF-8
encoded database, and the charset trouble on ticket creation from the
web disapeared.


-- 
Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
m...@netbsd.org

Reply via email to