On August 12, 1999 at 18:18, "Peter Seitz jun." wrote:
> I am archiving a german language discussion group and there are lots
> of umlauts in these mails.
>
> I'd like to convert the umlauts into entities so these umlauts can be
> read on various platforms (windows, Macintosh) correctly. I was not
> able to find out what I have to put into my resource files.
>
> Can someone please help?
Sure. The answer will differ depending on if you are dealing
with message header data or message body data.
Header:
CHARSETCONVERTERS are invoked when non-ASCII extension encoding
is encountered in message headers. That is the =?...?.?...?=
stuff. Now if the umlauts are in encoded as such, you can
get the effect you want.
By default MHonArc will convert 8-bit characters into entity
references, with the exception of iso-8859-1 character data.
The reasons is that most browsers default to iso-8859-1.
To change this, have something like the following in your
resource file:
<CharsetConverters>
iso-8859-1; iso_8859::str2sgml; iso8859.pl
</CharsetConverters>
If you a non-encoded/raw 8-bit character in the message
header, MHonArc keeps it as-is. To force a conversion to
an entity reference would require code changes to MHonArc
itself.
Body:
You'll have to tweak the text/plain filter to call
iso_8859::str2sgml when iso-8859-1 character data is
specified (it is already invoked for iso-8859-[2-10]), and probably
call iso_8859::str2sgml by default if you know there are
messages that do not specify a charset parameter in
the Content-Type field, but the message contains 8-bit
characters.
I should probably modify the text/plain filter to use
the functions specified in CHARSETCONVERTERS instead
of having a hard-coded mapping. The CHARSETCONVERTERS is
only checked for "-decode-" settings.
Note, the use iso_8859::str2sgml does incur a performance
penalty. See
<http://www.xray.mpe.mpg.de/mailing-lists/mhonarc/1998-02/msg00083.html>
(message-id <[EMAIL PROTECTED]>) for
more information.
--ewh