Hi,
A couple of months ago, I made a patch against Pine 4.44 to make it better support UTF-8 and I18N in general. I've been using it under xterm-16x under ko_KR.UTF-8 locale to send all my outgoing messages in UTF-8 and read incoming messages in various encodings (ISO-8859-x, Windows-125x, KOI8-R/U, ISO-2022-JP, EUC-KR, UTF-8, and so forth). About 20% of emails to this list archived on my local machine were sent with a version of Pine so that I thought some of you might be interested in my patch. The patch is available at http://jshin.net/i18n/pine4.44.iconv.patch. I have tested this only under Linux with glibc 2.2.x, but it should also work under any Unix-like OS with Bruno's libiconv or any other OS (where libiconv is ported.) My patch relies on that glibc/libiconvnv implementation of iconv(3) does transliteration when '//TRANSLIT' is added at the end of encoding names. (this dependency can be removed, but I was lazy.) The default iconv(3) under OS' like Solaris8/9 may not have this extension and won't work with my patch. And, this is linux-utf8 list so that I guess I can get away with that dependency here. To compile it, you have to use % ./build EXTRACFLAGS="-DHAVE_ICONV" target Three configuration options are added. I got the idea for two of them from Mutt 1.4.x/1.5.x * assumed-charset : a lot of emails sent by non-standard compliant MUAs/web mail programs have _raw_ 8bit characters (i.e. not encoded per RFC 2047) in the message header. Setting this to the most common of them would help you read those emails (subject, from, to, etc). For instance, Western European users would want to set this to ISO-8859-1/Windows-1252. Chinese(Simplified) users would set this to GB2312. This does NOT work for _untagged_ ( no MIME charset is specified in C-T header) message body, yet. For untagged message body, you have to define the display filter for US-ASCII as following: _CHARSET(US-ASCII)_ /usr/bin/iconv -c -f ISO-8859-1 -t UTF-8 or _CHARSET(US-ASCII)_ /usr/bin/iconv -c -f Windows-1252 -t UTF-8 or _CHARSET(US-ASCII)_ /usr/bin/iconv -c -f GB2312 -t UTF-8 * charset-aliases : Some MUAs use non-standard MIME charset names. For instance, MS Outlook Express uses ks_c_5601-1987 for EUC-KR or CP949(X-Windows-949). You can specify pairs of non-standard MIME charset and standard MIME charset with each pair delimetered by comma. In each pair, non-standard charset name and standard name should be delimetered by a colon. For instance, I have ks_c_5601-1987:x-windows-949,ksc5601:x-windows-949 * iconv-aliases : Iconv codeset names are not standardized and are not always the same as the standard MIME charset names. For instance, 'x-windows-949' in glibc implementation of iconv is 'mscp949' so that I have the following: x-windows-949:mscp949,euc-kr:mscp949 Although EUC-KR is understood by glibc implementation of iconv, I also have 'euc-kr:mscp949' because some emails in X-Windows-949 is MISLABELLED as in EUC-KR. X-Windows-949 (CP949) is upward compatible with EUC-KR and there's no harm in treating genuine EUC-KR text as X-Windows-949. The same is the case of ISO-8859-1 and Windows-1252. 'iso-8859-1:windows-1252' may be added to work around the problem. You can get the identical effect by adding it to charset-aliases list. You also have to set 'character-set' to 'UTF-8' and run Pine in UTF-8 terminal (xterm-16x, putty Solaris dtterm under UTF-8 locale, etc). In addition, you have to define a bunch of display filters because my patch doesn't use iconv internally to do automatic encoding/MIME charset conversion for the message body. However, it does automatic conversion for the message header. I have the following defined in my pinerc. I haven't checked yet whether '-c' option is specified in SUS3/POSIX. It may be a glibc/libiconv extension. display-filters=_CHARSET(EUC-KR)_ /usr/bin/iconv -c -f EUC-KR -t UTF-8, _CHARSET(ks_c_5601-1987)_ /usr/bin/iconv -c -f MSCP949 -t UTF-8, _CHARSET(US-ASCII)_ /usr/bin/iconv -c -f MSCP949 -t UTF-8, _CHARSET(ISO-8859-1)_ /usr/bin/iconv -c -f Windows-1252 -t UTF-8, _CHARSET(ISO-8859-15)_ /usr/bin/iconv -c -f ISO8859-15 -t UTF-8, _CHARSET(ISO-2022-JP)_ /usr/bin/iconv -c -f ISO-2022-JP -t UTF-8, _CHARSET(GB2312)_ /usr/bin/iconv -c -f GB2312 -t UTF-8, _CHARSET(BIG5)_ /usr/bin/iconv -c -f BIG5 -t UTF-8, _CHARSET(Windows-1251)_ /usr/bin/iconv -c -f WINDOWS-1251 -t UTF-8, _CHARSET(Windows-1252)_ /usr/bin/iconv -c -f WINDOWS-1252 -t UTF-8, _CHARSET(Windows-1253)_ /usr/bin/iconv -c -f WINDOWS-1253 -t UTF-8, _CHARSET(ISO-8859-2)_ /usr/bin/iconv -c -f ISO8859-2 -t UTF-8, _CHARSET(ISO-8859-3)_ /usr/bin/iconv -c -f ISO8859-3 -t UTF-8, _CHARSET(ISO-8859-4)_ /usr/bin/iconv -c -f ISO8859-4 -t UTF-8, _CHARSET(ISO-8859-5)_ /usr/bin/iconv -c -f ISO8859-5 -t UTF-8, _CHARSET(ISO-8859-6)_ /usr/bin/iconv -c -f ISO8859-6 -t UTF-8, _CHARSET(ISO-8859-7)_ /usr/bin/iconv -c -f ISO8859-7 -t UTF-8, _CHARSET(ISO-8859-8)_ /usr/bin/iconv -c -f ISO8859-8 -t UTF-8, _CHARSET(ISO-8859-9)_ /usr/bin/iconv -c -f ISO8859-9 -t UTF-8, _CHARSET(ISO-8859-10)_ /usr/bin/iconv -c -f ISO8859-10 -t UTF-8, _CHARSET(ISO-8859-11)_ /usr/bin/iconv -c -f ISO8859-11 -t UTF-8, _CHARSET(ISO-8859-13)_ /usr/bin/iconv -c -f ISO8859-13 -t UTF-8, _CHARSET(ISO-8859-14)_ /usr/bin/iconv -c -f ISO8859-14 -t UTF-8, _CHARSET(ISO-8859-16)_ /usr/bin/iconv -c -f ISO8859-16 -t UTF-8, _CHARSET(KOI8-R)_ /usr/bin/iconv -c -f KOI8-R -t UTF-8, _CHARSET(KOI8-U)_ /usr/bin/iconv -c -f KOI8-U -t UTF-8, _CHARSET(Windows-874)_ /usr/bin/iconv -c -f CP874 -t UTF-8, _CHARSET(UTF-7)_ /usr/bin/iconv -c -f UTF-7 -t UTF-8 There are a couple of problems with my patch. One of them is that I haven't done anything to fix 'one octet -> one column width model'. In UTF-8, this false assumption completely breaks down except for characters in US-ASCII(U+0020 - U+007E) as you are well aware. Therefore,in the message display screen, lines are wrapped prematurely and in the message index screen, headers (subject, recipient, etc) are truncated prematurely. The other is that somehow the link to 'email list management information' at the end of a message with 'list management information' header does not work. I guess it's easy to fix, but I haven't gotten around to look into it yet. There may be other problems as well. I'll be glad to hear about them, although I may not be able to fix them as quickly as I wish to. BTW, Pine 4.44 with my patch can also be run under non-UTF-8 terminal. In that case, you have to set 'character-set' to the encoding of your terminal (say, EUC-JP) and define your display filters accordingly. My goal was to make Pine a text-terminal version of MS OE or Mozilla-mail in terms of I18N support. With my patch, Pine got closer to that goal, but is still far from it. Some of features I want to see include: - The encoding(MIME charset) for outgoing emails should be decoupled from the encoding of a terminal under which Pine is launched. - It should be possible to change the encoding(MIME charset) of outgoing messages _at the time of_ composition (as is possible with MS OE and Mozilla-Mail.) Although going all the way to UTF-8 is desirable, the reality is that some of my correspondents cannot deal with UTF-8 messages. For them, I have to write in legacy encodings. Currently, I have to launch another Pine with a separate pinerc to compose my email in a legacy encoding. - The internal encoding conversion (as opposed to relying on users setting display filters correctly in pinerc) with iconv - 'assumed-charset' should be settable per-folder basis as well as globally. Hope a lot of people find my patch useful, Jungshik Shin -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
