On 12/26/2009 02:29 PM, Vincent Massol wrote: > > On Dec 26, 2009, at 2:17 PM, Sergiu Dumitriu wrote: > >> Hi devs, >> >> The short version: >> >> Should we always use UTF-8 for encoding and decoding URLs, >> regardless of >> the wiki encoding, for better compliance with web standards? >> >> >> The long version: >> >> By definition, URLs can only contain ASCII characters, everything else >> must be converted to their corresponding bytes and escaped as %XY >> escapes. The problem is that "their corresponding bytes" implies a >> charset + encoding, and no specification *enforces* a specific pair, >> although it is *recommended* to use Unicode + UTF8, to comply with the >> modern tendency of the web in general. >> >> Traditionally, XWiki has been using the same encoding as the >> configured >> global wiki encoding for the URLs, which means that before 1.9, when >> we >> switched to UTF8 as the default wiki encoding, all URLs were using the >> ISO-8859-1 encoding. Since the switch to UTF-8, URLs are also using >> the >> UTF-8 encoding by default, although the wiki encoding can be changed. >> >> Now, since 2.1, a bugfix accidentally changed the behavior, so that >> parsing back URLs always uses the UTF-8 encoding, even though >> composing >> URLs continues to use the wiki encoding. This is a bug, which prevents >> changing the encoding to anything other than UTF-8, and it should be >> fixed. >> >> Now, we have two options: >> >> 1. Reintroduce the old behavior, so that URLs always use the wiki >> encoding. This is a direct bugfix. >> 2. Also change the encoding part, so that UTF-8 is always used. This >> is >> an improvement, going towards better compliance with web standards. >> >> Personally I think that the second option is the better one, but it >> requires a vote, since it has a few drawbacks. >> >> Advantages: >> + better compliance with web standards, since UTF-8 is the recommended >> encoding for URLs (although not imposed) > > Is there any reference to this? Some RFC that we could quote in the > code?
The URL RFC predates the wide adoption of UTF, so it does not mention any encoding (see http://tools.ietf.org/html/rfc1738#section-2.2 ). This is why I said that there's no enforcement. However, the URI RFC, which is a generalization of URLs, enforces UTF-8 (see http://tools.ietf.org/html/rfc3986#section-2.5 ). The URI RFC officially *updates* the URL RFC, so we can say that URLs are currently standardized by the new, UTF-8 enforcing RFC 3986, although for backwards compatibility URLs can still be used with the RFC 1738 definition. >> + support for a wider range of document names, since UTF-8 allows >> full-unicode document names, while ISO-8859-1 limits names to latin1 >> characters >> + better support from browsers, since entering accented characters >> directly in the address bar encodes the URL sent to the server using >> UTF-8, > > Is that true for all browsers? Is there a standard? FF, Opera, Chrome. Konqueror even displays %E9 as an invalid UTF character <?>, so it assumes even more that the URL is in UTF-8. IE6 does not automatically convert %XY escapes to their equivalent character, so it displays both %E9 and %C3%A9 (the utf encoding for é). However, entering é in the address bar converts to UTF-8 bytes. Also note that IE6 predates the RFC 3986 by several years, so it has the right not to assume UTF-8 in URLs. >> and decoding the URL also assumes UTF-8; this means that a >> document named "é" will be printed as .../view/Main/%E9 and will >> have to >> be entered the same way in the address bar when ISO-8859-1 is used, >> and >> as .../view/Main/é when UTF-8 is used >> >> Drawbacks: >> - by default Tomcat uses ISO-8859-1 as the encoding for URLs, so the >> Tomcat configuration will have to be changed as in >> http://platform.xwiki.org/xwiki/bin/view/AdminGuide/Encoding#HTomcat > > Any known issue logged against Tomcat? Any planned fix? If this is > really a web standard why wouldn't Tomcat change this behavior? According to http://wiki.apache.org/tomcat/Tomcat/UTF-8 , Tomcat developers interpret RFC 2616 (the HTTP 1.1 specification) as to recommend the ISO-8859-1 charset as the default. However, this is not true. The RFC references the URI RFC as the standard for the requested address, and the 8859-1 is the default for the response *body* only. The original URI RFC referenced by the HTTP RFC, http://www.ietf.org/rfc/rfc2396.txt , does not recommend a default charset+encoding, although it only mentions UTF-8 as a possible encoding, and doesn't mention ISO-8859-1 at all. So I'd rather say that the HTTP 1.1 specification does not recommend any default encoding for addresses, rather than ISO-8859-1, and it hints towards UTF-8. And that RFC has been deprecated in favor of RFC 3986, which clearly states that UTF-8 is used. Perhaps I should send this interpretation to the Tomcat guys. >> - some existing bookmarks will not work anymore once the encoding is >> changed >> >> +1 for option 2 from me, > > Before voting I'd like to see how "standard" this is. If it's really > standard and we have "official" standard docs to back this then I > agree that we should choose 2. IMO, RFC 3986 is the current standard and the one we should follow, and it does specify UTF-8 as the ONLY encoding, not just the default. -- Sergiu Dumitriu http://purl.org/net/sergiu/ _______________________________________________ devs mailing list [email protected] http://lists.xwiki.org/mailman/listinfo/devs

