a patch to pine4.44 for a better UTF-8(I18N) support

Jungshik Shin Fri, 12 Jul 2002 01:56:43 -0700

Hi,


 A couple of months ago, I made a patch against Pine 4.44 to make it
better support UTF-8 and I18N in general. I've been using it under
xterm-16x under ko_KR.UTF-8 locale to send all my outgoing messages
in UTF-8 and read incoming messages in various encodings (ISO-8859-x,
Windows-125x, KOI8-R/U, ISO-2022-JP, EUC-KR, UTF-8, and so forth).
About 20% of emails to this list archived on my local machine
were sent with a version of Pine so that I thought some of you
might be interested in my patch. The patch is available
at http://jshin.net/i18n/pine4.44.iconv.patch.

  I have tested this only under Linux with glibc 2.2.x, but it should
also work under any Unix-like OS  with Bruno's libiconv or any other OS
(where libiconv is ported.)  My patch relies on that glibc/libiconvnv
implementation of iconv(3) does transliteration when '//TRANSLIT' is
added at the end of encoding names. (this dependency can be removed,
but I was lazy.) The default iconv(3) under OS' like Solaris8/9 may not
have this extension and won't work with my patch. And, this is linux-utf8
list so that I guess I can get away with that dependency here.

 To compile it, you have to use

 %  ./build EXTRACFLAGS="-DHAVE_ICONV" target



 Three configuration options are added. I got the idea for
two of them from Mutt 1.4.x/1.5.x

 * assumed-charset : a lot of emails sent by non-standard compliant
                    MUAs/web mail programs have _raw_ 8bit characters (i.e.
                    not encoded per RFC 2047) in the message header.
                    Setting this to
                    the most common of them would help you read those
                    emails (subject, from, to, etc). For instance,
                    Western European users would want to set this
                    to ISO-8859-1/Windows-1252. Chinese(Simplified) users
                    would set
                    this to GB2312. This does NOT work for _untagged_ (
                    no MIME charset is specified in C-T header)
                    message body, yet. For untagged message body,
                    you have to
                    define the display filter for US-ASCII as following:

                   _CHARSET(US-ASCII)_ /usr/bin/iconv -c -f ISO-8859-1 -t UTF-8

                        or

                   _CHARSET(US-ASCII)_ /usr/bin/iconv -c -f Windows-1252 -t UTF-8
                       or

                    _CHARSET(US-ASCII)_ /usr/bin/iconv -c -f GB2312 -t UTF-8

  * charset-aliases : Some MUAs use non-standard MIME charset names. For
                      instance, MS Outlook Express uses ks_c_5601-1987
                      for EUC-KR or CP949(X-Windows-949). You can
                      specify pairs of non-standard MIME charset
                      and standard MIME charset with each pair
                      delimetered by comma. In each pair, non-standard
                      charset name and standard name should be
                      delimetered  by a colon. For instance, I have

                      ks_c_5601-1987:x-windows-949,ksc5601:x-windows-949

  * iconv-aliases :  Iconv codeset names are not standardized
                     and are not always the same as
                     the standard MIME charset names. For instance,
                     'x-windows-949' in glibc implementation of iconv
                     is 'mscp949' so that I have the following:

                     x-windows-949:mscp949,euc-kr:mscp949

                     Although EUC-KR is understood by glibc
                     implementation of iconv, I also have
                     'euc-kr:mscp949'
                     because some emails in X-Windows-949 is MISLABELLED
                     as in EUC-KR.  X-Windows-949 (CP949) is upward
                     compatible with EUC-KR and there's no harm in
                     treating  genuine EUC-KR text as X-Windows-949.
                     The same is the case of  ISO-8859-1 and Windows-1252.
                     'iso-8859-1:windows-1252' may be added to work
                     around the problem. You can get the
                     identical effect by adding it to charset-aliases
                     list.

  You also have to set 'character-set' to 'UTF-8' and run Pine in UTF-8
terminal (xterm-16x, putty  Solaris dtterm under UTF-8 locale, etc).

 In addition, you have to define a bunch of display filters because
my patch doesn't use iconv internally to do automatic encoding/MIME
charset conversion for the message body. However, it does automatic
conversion for the message header. I have the following defined
in my pinerc. I haven't checked yet whether '-c' option is
specified in SUS3/POSIX. It may be a glibc/libiconv extension.

display-filters=_CHARSET(EUC-KR)_ /usr/bin/iconv -c -f EUC-KR -t UTF-8,
        _CHARSET(ks_c_5601-1987)_ /usr/bin/iconv -c -f MSCP949 -t UTF-8,
        _CHARSET(US-ASCII)_ /usr/bin/iconv -c -f MSCP949 -t UTF-8,
        _CHARSET(ISO-8859-1)_ /usr/bin/iconv -c -f Windows-1252 -t UTF-8,
        _CHARSET(ISO-8859-15)_ /usr/bin/iconv -c -f ISO8859-15 -t UTF-8,
        _CHARSET(ISO-2022-JP)_ /usr/bin/iconv -c -f ISO-2022-JP  -t UTF-8,
        _CHARSET(GB2312)_ /usr/bin/iconv -c -f GB2312  -t UTF-8,
        _CHARSET(BIG5)_ /usr/bin/iconv -c -f BIG5  -t UTF-8,
        _CHARSET(Windows-1251)_ /usr/bin/iconv -c -f WINDOWS-1251 -t UTF-8,
        _CHARSET(Windows-1252)_ /usr/bin/iconv -c -f WINDOWS-1252 -t UTF-8,
        _CHARSET(Windows-1253)_ /usr/bin/iconv -c -f WINDOWS-1253 -t UTF-8,
        _CHARSET(ISO-8859-2)_ /usr/bin/iconv -c -f ISO8859-2 -t UTF-8,
        _CHARSET(ISO-8859-3)_ /usr/bin/iconv -c -f ISO8859-3 -t UTF-8,
        _CHARSET(ISO-8859-4)_ /usr/bin/iconv -c -f ISO8859-4 -t UTF-8,
        _CHARSET(ISO-8859-5)_ /usr/bin/iconv -c -f ISO8859-5 -t UTF-8,
        _CHARSET(ISO-8859-6)_ /usr/bin/iconv -c -f ISO8859-6 -t UTF-8,
        _CHARSET(ISO-8859-7)_ /usr/bin/iconv -c -f ISO8859-7 -t UTF-8,
        _CHARSET(ISO-8859-8)_ /usr/bin/iconv -c -f ISO8859-8 -t UTF-8,
        _CHARSET(ISO-8859-9)_ /usr/bin/iconv -c -f ISO8859-9 -t UTF-8,
        _CHARSET(ISO-8859-10)_ /usr/bin/iconv -c -f ISO8859-10 -t UTF-8,
        _CHARSET(ISO-8859-11)_ /usr/bin/iconv -c -f ISO8859-11 -t UTF-8,
        _CHARSET(ISO-8859-13)_ /usr/bin/iconv -c -f ISO8859-13 -t UTF-8,
        _CHARSET(ISO-8859-14)_ /usr/bin/iconv -c -f ISO8859-14 -t UTF-8,
        _CHARSET(ISO-8859-16)_ /usr/bin/iconv -c -f ISO8859-16 -t UTF-8,
        _CHARSET(KOI8-R)_ /usr/bin/iconv -c -f KOI8-R -t UTF-8,
        _CHARSET(KOI8-U)_ /usr/bin/iconv -c -f KOI8-U -t UTF-8,
        _CHARSET(Windows-874)_ /usr/bin/iconv -c -f CP874 -t UTF-8,
        _CHARSET(UTF-7)_ /usr/bin/iconv -c -f UTF-7 -t UTF-8


  There are a couple of problems with my patch.

    One of them is that I haven't done anything to fix 'one octet ->
one column width model'.  In UTF-8, this false assumption completely
breaks down except for characters in US-ASCII(U+0020 - U+007E) as you
are well aware. Therefore,in  the message display screen, lines are
wrapped prematurely and in the message index screen, headers (subject,
recipient, etc) are truncated prematurely.

  The other is that somehow  the link to 'email  list management
information' at the end of a message with 'list management information'
header does not work. I guess it's easy to fix, but I haven't gotten
around to look into it yet.


  There may be other problems as well. I'll be glad to hear about them,
although I may not be able to fix them as quickly as I wish to.

  BTW, Pine 4.44 with my patch can also be run under non-UTF-8 terminal.
In that case, you have to set 'character-set' to the encoding of
your terminal (say, EUC-JP) and define your display filters accordingly.

  My goal was to make Pine a text-terminal version of MS OE or
Mozilla-mail in terms of I18N support. With my patch, Pine got
closer to that goal, but is still far from it. Some of features
I want to see include:


  - The encoding(MIME charset) for outgoing emails should be
    decoupled from the encoding of a terminal under which Pine
    is launched.

  - It should be possible to change the encoding(MIME charset)
    of outgoing messages _at the time of_ composition
    (as is possible with MS OE and Mozilla-Mail.)
    Although going all the way to UTF-8 is desirable,
    the reality is that some of my correspondents cannot
    deal with UTF-8 messages. For them, I have to
    write in legacy encodings. Currently, I  have to
    launch another Pine with a separate pinerc to compose
    my email in a legacy encoding.

  - The internal encoding conversion (as opposed to relying on
    users setting display filters correctly in pinerc) with iconv

  - 'assumed-charset' should  be settable per-folder basis as well as
     globally.


   Hope a lot of people find my patch useful,

   Jungshik Shin

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

a patch to pine4.44 for a better UTF-8(I18N) support

Reply via email to