On 04/05/2002 10:14 PM, Markus Kuhn wrote: > When I enter a Unicode character (Mozilla 0.9.9 nicely supports UTF-8 > cut&paste from xterm) into a bugzilla bug description, then the resulting > web page shows these characters as human-readable numeric character > references. Example: > > http://bugzilla.mozilla.org/show_bug.cgi?id=135762 > > What exactly do the W3C standards say about how Unicode characters > entered into form fields are supposed to be submitted by the HTTP > client to the server.
Just had a look into the HTML 4 spec, chapter "Processing form data". (http://www.w3.org/TR/html401/interact/forms.html#h-17.13.4): " * If the method is "get" and the action is an HTTP URI, the user agent takes the value of action, appends a `?' to it, then appends the form data set, encoded using the "application/x-www-form-urlencoded" content type. The user agent then traverses the link to this URI. In this scenario, form data are restricted to ASCII codes. * If the method is "post" and the action is an HTTP URI, the user agent conducts an HTTP "post" transaction using the value of the action attribute and a message created according to the content type specified by the enctype attribute." Sounds to me like: whenever a form is submitted by "GET" the character set is restricted to ASCII, no Unicode possible here! And it looks like this is true whenever data is appended as a query string to the URL using the encoding "application/x-www-form-urlencoded", even in POST requests. To POST other than ASCII character data is only possible using the MIME-Type "multipart/formdata", and the user agent has to specify a "Content-Type" including the "charset" parameter for the data of each form element. (In practice, LYNX is the only browser I have seen so far that really adds the "charset" parameter to the data of the form fields.) Christoph -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
