On Dec 26, 2009, at 2:17 PM, Sergiu Dumitriu wrote:

> Hi devs,
>
> The short version:
>
> Should we always use UTF-8 for encoding and decoding URLs,  
> regardless of
> the wiki encoding, for better compliance with web standards?
>
>
> The long version:
>
> By definition, URLs can only contain ASCII characters, everything else
> must be converted to their corresponding bytes and escaped as %XY
> escapes. The problem is that "their corresponding bytes" implies a
> charset + encoding, and no specification *enforces* a specific pair,
> although it is *recommended* to use Unicode + UTF8, to comply with the
> modern tendency of the web in general.
>
> Traditionally, XWiki has been using the same encoding as the  
> configured
> global wiki encoding for the URLs, which means that before 1.9, when  
> we
> switched to UTF8 as the default wiki encoding, all URLs were using the
> ISO-8859-1 encoding. Since the switch to UTF-8, URLs are also using  
> the
> UTF-8 encoding by default, although the wiki encoding can be changed.
>
> Now, since 2.1, a bugfix accidentally changed the behavior, so that
> parsing back URLs always uses the UTF-8 encoding, even though  
> composing
> URLs continues to use the wiki encoding. This is a bug, which prevents
> changing the encoding to anything other than UTF-8, and it should be  
> fixed.
>
> Now, we have two options:
>
> 1. Reintroduce the old behavior, so that URLs always use the wiki
> encoding. This is a direct bugfix.
> 2. Also change the encoding part, so that UTF-8 is always used. This  
> is
> an improvement, going towards better compliance with web standards.
>
> Personally I think that the second option is the better one, but it
> requires a vote, since it has a few drawbacks.
>
> Advantages:
> + better compliance with web standards, since UTF-8 is the recommended
> encoding for URLs (although not imposed)

Is there any reference to this? Some RFC that we could quote in the  
code?

> + support for a wider range of document names, since UTF-8 allows
> full-unicode document names, while ISO-8859-1 limits names to latin1
> characters
> + better support from browsers, since entering accented characters
> directly in the address bar encodes the URL sent to the server using
> UTF-8,

Is that true for all browsers? Is there a standard?

> and decoding the URL also assumes UTF-8; this means that a
> document named "é" will be printed as .../view/Main/%E9 and will  
> have to
> be entered the same way in the address bar when ISO-8859-1 is used,  
> and
> as .../view/Main/é when UTF-8 is used
>
> Drawbacks:
> - by default Tomcat uses ISO-8859-1 as the encoding for URLs, so the
> Tomcat configuration will have to be changed as in
> http://platform.xwiki.org/xwiki/bin/view/AdminGuide/Encoding#HTomcat

Any known issue logged against Tomcat? Any planned fix? If this is  
really a web standard why wouldn't Tomcat change this behavior?

> - some existing bookmarks will not work anymore once the encoding is  
> changed
>
> +1 for option 2 from me,

Before voting I'd like to see how "standard" this is. If it's really  
standard and we have "official" standard docs to back this then I  
agree that we should choose 2.

Thanks
-Vincent

_______________________________________________
devs mailing list
[email protected]
http://lists.xwiki.org/mailman/listinfo/devs

Reply via email to