DO NOT REPLY [Bug 34985] - utf8 to ucs2 conversion failed on Windows

bugzilla Sat, 04 Nov 2006 12:14:26 -0800

DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG·
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://issues.apache.org/bugzilla/show_bug.cgi?id=34985>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND·
INSERTED IN THE BUG DATABASE.


http://issues.apache.org/bugzilla/show_bug.cgi?id=34985





------- Additional Comments From [EMAIL PROTECTED]  2006-11-04 12:14 -------
Not sure what you mean by security implications, but I don't think that falling
back to another encoding such as ISO-8859-1 is necessary.

Taking TWiki as an example, which uses paths like /bin/view/Main/WebHome, where
view is the CGI script, and /Main/WebHome is the PATH_INFO (see
http://twiki.org/cgi-bin/viewfile/Support/ApacheErrorsDuringEdit?rev=1.1;filename=testenv.htm
for example of CGI environment variables), it would be useful to specify the
following to handle non-UTF-8 encodings such as ISO-8859-1 (which are used by
POST from Firefox currently):

AUTH_TYPE       Raw
DOCUMENT_ROOT   Convert 
GATEWAY_INTERFACE       Raw 
HTTP_ACCEPT     Raw
HTTP_ACCEPT_CHARSET     Raw
HTTP_ACCEPT_ENCODING    Raw
HTTP_ACCEPT_LANGUAGE    Raw
HTTP_CONNECTION Raw
HTTP_HOST       Raw
HTTP_KEEP_ALIVE Raw
HTTP_USER_AGENT Raw
PATH    Convert (since it has pathnames)
QUERY_STRING    Raw (not a filename, should be interpreted by application)
REMOTE_ADDR     Raw
REMOTE_PORT     Raw
REMOTE_USER     Raw
REQUEST_METHOD  Raw
REQUEST_URI     Convert if valid UTF-8 (and not overlong encoding)
SCRIPT_FILENAME Convert if valid UTF-8 (and not overlong encoding)
SCRIPT_NAME     Convert if valid UTF-8 (and not overlong encoding)
SERVER_ADDR     Raw
SERVER_ADMIN    Raw
....
(rest are all raw)

Basically, only those variables that correspond to filenames should be
converted, and then only if they are valid UTF-8 without overlong encoding.

Any variables not used by Apache should not be converted, but left to the
application, or a suitable add-on Apache module for conversion.

TWiki has done its own interpretation of UTF-8 URLs, independent of the OS it is
running on, which is based on a technique used by IBM's web server for mainframe
(z/OS) - basically it tries to recognise the URL as UTF-8 and then falls back to
the native encoding (i.e. no conversion done at all).  In fact we do this on the
PATH_INFO ourselves.

If Apache is going to carry on doing its own UTF-8 to UCS-2 conversion, which I
suppose it must do in some cases that map onto a Windows filesystem (and others
such as MacOS X HFS+ etc), it would be good if it recognises when data is really
UTF-8 in this way.  Also, it would be very helpful to have a configuration
option that lets you say "don't convert variable X if it matches regex Y", e.g.
don't convert PATH_INFO if it matches "/twiki/bin/.*"

Some TWiki pages that might be of interest here are:

http://twiki.org/cgi-bin/view/Codev/EncodeURLsWithUTF8 - how TWiki does
auto-detection and conversion of UTF-8 encoding for PATH_INFO in URLs

http://twiki.org/cgi-bin/view/Codev/InternationalisationUTF8 - includes material
on character set auto-detection including excerpt on IBM web server approach -
fortunately UTF-8 detection is much easier than the general case.

http://twiki.org/cgi-bin/view/Codev/MacOSXFilesystemEncodingWithI18N - talks
about a filesystem-related issue with Unicode normalisation forms on Mac OS X 

http://twiki.org/cgi-bin/view/Codev/ProposedUTF8SupportForI18N - general page
summarising research on UTF-8 for TWiki, including some useful links







-- 
Configure bugmail: http://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

DO NOT REPLY [Bug 34985] - utf8 to ucs2 conversion failed on Windows

Reply via email to