Re: Entering Unicode characters into bugzilla and other web forms

jshin Sat, 13 Apr 2002 13:44:11 -0700

On Sat, 13 Apr 2002, Glenn Maynard wrote:

> On Fri, Apr 05, 2002 at 09:14:15PM +0100, Markus Kuhn wrote:
> > When I enter a Unicode character (Mozilla 0.9.9 nicely supports UTF-8
> > cut&paste from xterm) into a bugzilla bug description, then the resulting
> > web page shows these characters as human-readable numeric character
> > references. Example:
> > 
> >   http://bugzilla.mozilla.org/show_bug.cgi?id=135762
> > 
> > What exactly do the W3C standards say about how Unicode characters
> > entered into form fields are supposed to be submitted by the HTTP
> > client to the server.


  Here's the detailed discussion of the issue. It's a very
complicated problem partly because the standard came late
and various practices had been used before.

  http://ppewww.ph.gla.ac.uk/~flavell/charset/form-i18n.html

It seems like Mozilla's behavior is similar to that of MS IE.  When
characters outside the repertoire of the current encoding are entered into
a form field, they're turned into NCR before being sent to the server.
For instance, if the current encoding of your browser is set to ISO-8859-1
and you cut & paste UTF-8 text (with chars.  outside Latin1 repertoire)
into a form field, those not covered by Latin-1 are converted to NCR
before being handed over to the server.

  Euro and 'double-right-quotation-mark' (not in
ISO-8859-1 but in CP1252) don't turn into NCR because Mozilla is treating
ISO-8859-1 and CP1252 identically.

  You wouldn't have had the problem if you had set your browser(Mozilla)
encoding to UTF-8 when you cut&pasted UTF-8 text into a form field.

> This doesn't answer your question, but it's relevant: IE5 has an option
> in its configuration, "always send URLs as UTF-8".  It defaults on.  I
> don't know what it does when this is turned off, and I don't know if
> either mode is standards-conformant.

  Actually, that's a different issue. Some html docs have
URLs embedded in the encoding of that document. For example, a Japanese
html document in EUC-JP can have 

  <a href="http://www.xyz.co.jp/..../file name1 in EUC-JP>Link1</a>
  <a href="http://www.xyz.co.jp/..../file name2 in EUC-JP>Link2</a>

When that option is turned on, MS IE converts 'file name1 in EUC-JP' to
UTF-8 and then URL-encode (%hh%hh) the result before sending the request
to the server. To serve the request, the server has to URL-decode (as
all other servers do) and then convert UTF-8 back to the local encoding
(it could be EUC-JP, Shift_JIS or even UTF-8).  A lot of web pages in
Korea and Japan have this kind of embedded URLs in the encoding of the
document (EUC-JP, Shift_JIS, EUC-KR,etc). Because URL itself cannot have
any information about the encoding used, it's reasonable to assume that
they're in the encoding of the html file where they're in and convert it
to UTF-8 before URL-encoding it.  Please, note that URLs in one document
can point to documents at another server so that we cannot assume that
the receiving side knows about the encoding used in URL request.

  If a web server serving the document doesn't do the second
step, the request cannot be fulfilled unless the local file system uses
UTF-8. There are a couple of Apache modules that take care of this step
of converting UTF-8 back to the  encoding of the local file system.

   Jungshik Shin

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: Entering Unicode characters into bugzilla and other web forms

Reply via email to