Hi, From: Tomohiro KUBOTA <[EMAIL PROTECTED]> Subject: Re: automatically-generated ISO-8859-1 characters in mulbibyte webpages Date: Fri, 03 Jan 2003 09:06:43 +0900 (JST)
> BTW, I found similar trouble in lists.debian.org pages. In thread-list > pages or date-list pages like > > http://lists.debian.org/debian-devel/2002/debian-devel-200212/threads.html, > > there are no charset specification. In such cases, web browsers will > assume these pages according to user preference. Naturally, Japanese > people configure web browsers to "assume Japanese encoding for pages > without charset specification". On the other hand, the thread-list > pages show senders' names in <em> format, and threfore, a tag </em> > follows the name. If the last letter of the name is 8bit, the tag > is broken. The result is that all following part are shown in <em> > (italic) format. > > The test is easy: please configure your browser to "assume Japanese > encoding for pages without charset specification" and load the above > page. > > > However, in this case, the solution is a bit complicated. All mails > should have encoding information in MIME format. Thus, the best > solution would be to parse MIME. On the other hand, the simplest > makeshift solution is to add "charset=iso8859-1" for all pages > but there are mailing lists where most of 8bit characters are > cyrillic and so on. I found that MHonArc has a feature to solve this problem. http://www.mhonarc.org/MHonArc/doc/faq/mime.html#nonascii I checked /org/lists.debian.org/mhonarc/debian.rc and found that it seems to ssume that any 8bit characters are ISO-8859-1. > <CharsetConverters> > plain; mhonarc::htmlize; > us-ascii; mhonarc::htmlize; > iso-8859-1; mhonarc::htmlize; > iso-8859-2; iso_8859::str2sgml; iso8859.pl > iso-8859-3; iso_8859::str2sgml; iso8859.pl Why not use iso_8859::str2sgml; instead of mhonarc::htmlize for iso-8859-1? (Though I am new to MHonArc, I imagine that iso_8859::str2sgml converts ISO-8859 8bit characters into SGML entity like "ö".) It would be nice if we can convert raw 8bit mail headers (though it is illegal; it sometimes happens and may cause breaking the lists.debian.org pages) to SGML entities by assuming they are ISO-8859-1. Since this may annoy Russian (and other non-ISO-8859-1) people who happen to use MUAs which generates illegal mail headers with 8bit characters without charset specification, I'd like to hear from people from various countries. --- Tomohiro KUBOTA <[EMAIL PROTECTED]> http://www.debian.or.jp/~kubota/

