On Sat, 13 Apr 2002, Glenn Maynard wrote: > On Fri, Apr 05, 2002 at 09:14:15PM +0100, Markus Kuhn wrote: > > When I enter a Unicode character (Mozilla 0.9.9 nicely supports UTF-8 > > cut&paste from xterm) into a bugzilla bug description, then the resulting > > web page shows these characters as human-readable numeric character > > references. Example: > > > > http://bugzilla.mozilla.org/show_bug.cgi?id=135762 > > > > What exactly do the W3C standards say about how Unicode characters > > entered into form fields are supposed to be submitted by the HTTP > > client to the server.
Here's the detailed discussion of the issue. It's a very complicated problem partly because the standard came late and various practices had been used before. http://ppewww.ph.gla.ac.uk/~flavell/charset/form-i18n.html It seems like Mozilla's behavior is similar to that of MS IE. When characters outside the repertoire of the current encoding are entered into a form field, they're turned into NCR before being sent to the server. For instance, if the current encoding of your browser is set to ISO-8859-1 and you cut & paste UTF-8 text (with chars. outside Latin1 repertoire) into a form field, those not covered by Latin-1 are converted to NCR before being handed over to the server. Euro and 'double-right-quotation-mark' (not in ISO-8859-1 but in CP1252) don't turn into NCR because Mozilla is treating ISO-8859-1 and CP1252 identically. You wouldn't have had the problem if you had set your browser(Mozilla) encoding to UTF-8 when you cut&pasted UTF-8 text into a form field. > This doesn't answer your question, but it's relevant: IE5 has an option > in its configuration, "always send URLs as UTF-8". It defaults on. I > don't know what it does when this is turned off, and I don't know if > either mode is standards-conformant. Actually, that's a different issue. Some html docs have URLs embedded in the encoding of that document. For example, a Japanese html document in EUC-JP can have <a href="http://www.xyz.co.jp/..../file name1 in EUC-JP>Link1</a> <a href="http://www.xyz.co.jp/..../file name2 in EUC-JP>Link2</a> When that option is turned on, MS IE converts 'file name1 in EUC-JP' to UTF-8 and then URL-encode (%hh%hh) the result before sending the request to the server. To serve the request, the server has to URL-decode (as all other servers do) and then convert UTF-8 back to the local encoding (it could be EUC-JP, Shift_JIS or even UTF-8). A lot of web pages in Korea and Japan have this kind of embedded URLs in the encoding of the document (EUC-JP, Shift_JIS, EUC-KR,etc). Because URL itself cannot have any information about the encoding used, it's reasonable to assume that they're in the encoding of the html file where they're in and convert it to UTF-8 before URL-encoding it. Please, note that URLs in one document can point to documents at another server so that we cannot assume that the receiving side knows about the encoding used in URL request. If a web server serving the document doesn't do the second step, the request cannot be fulfilled unless the local file system uses UTF-8. There are a couple of Apache modules that take care of this step of converting UTF-8 back to the encoding of the local file system. Jungshik Shin -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
