On Dec 26, 2009, at 2:17 PM, Sergiu Dumitriu wrote: > Hi devs, > > The short version: > > Should we always use UTF-8 for encoding and decoding URLs, > regardless of > the wiki encoding, for better compliance with web standards? > > > The long version: > > By definition, URLs can only contain ASCII characters, everything else > must be converted to their corresponding bytes and escaped as %XY > escapes. The problem is that "their corresponding bytes" implies a > charset + encoding, and no specification *enforces* a specific pair, > although it is *recommended* to use Unicode + UTF8, to comply with the > modern tendency of the web in general. > > Traditionally, XWiki has been using the same encoding as the > configured > global wiki encoding for the URLs, which means that before 1.9, when > we > switched to UTF8 as the default wiki encoding, all URLs were using the > ISO-8859-1 encoding. Since the switch to UTF-8, URLs are also using > the > UTF-8 encoding by default, although the wiki encoding can be changed. > > Now, since 2.1, a bugfix accidentally changed the behavior, so that > parsing back URLs always uses the UTF-8 encoding, even though > composing > URLs continues to use the wiki encoding. This is a bug, which prevents > changing the encoding to anything other than UTF-8, and it should be > fixed. > > Now, we have two options: > > 1. Reintroduce the old behavior, so that URLs always use the wiki > encoding. This is a direct bugfix. > 2. Also change the encoding part, so that UTF-8 is always used. This > is > an improvement, going towards better compliance with web standards. > > Personally I think that the second option is the better one, but it > requires a vote, since it has a few drawbacks. > > Advantages: > + better compliance with web standards, since UTF-8 is the recommended > encoding for URLs (although not imposed)
Is there any reference to this? Some RFC that we could quote in the code? > + support for a wider range of document names, since UTF-8 allows > full-unicode document names, while ISO-8859-1 limits names to latin1 > characters > + better support from browsers, since entering accented characters > directly in the address bar encodes the URL sent to the server using > UTF-8, Is that true for all browsers? Is there a standard? > and decoding the URL also assumes UTF-8; this means that a > document named "é" will be printed as .../view/Main/%E9 and will > have to > be entered the same way in the address bar when ISO-8859-1 is used, > and > as .../view/Main/é when UTF-8 is used > > Drawbacks: > - by default Tomcat uses ISO-8859-1 as the encoding for URLs, so the > Tomcat configuration will have to be changed as in > http://platform.xwiki.org/xwiki/bin/view/AdminGuide/Encoding#HTomcat Any known issue logged against Tomcat? Any planned fix? If this is really a web standard why wouldn't Tomcat change this behavior? > - some existing bookmarks will not work anymore once the encoding is > changed > > +1 for option 2 from me, Before voting I'd like to see how "standard" this is. If it's really standard and we have "official" standard docs to back this then I agree that we should choose 2. Thanks -Vincent _______________________________________________ devs mailing list [email protected] http://lists.xwiki.org/mailman/listinfo/devs

