Hi devs,

The short version:

Should we always use UTF-8 for encoding and decoding URLs, regardless of 
the wiki encoding, for better compliance with web standards?


The long version:

By definition, URLs can only contain ASCII characters, everything else 
must be converted to their corresponding bytes and escaped as %XY 
escapes. The problem is that "their corresponding bytes" implies a 
charset + encoding, and no specification *enforces* a specific pair, 
although it is *recommended* to use Unicode + UTF8, to comply with the 
modern tendency of the web in general.

Traditionally, XWiki has been using the same encoding as the configured 
global wiki encoding for the URLs, which means that before 1.9, when we 
switched to UTF8 as the default wiki encoding, all URLs were using the 
ISO-8859-1 encoding. Since the switch to UTF-8, URLs are also using the 
UTF-8 encoding by default, although the wiki encoding can be changed.

Now, since 2.1, a bugfix accidentally changed the behavior, so that 
parsing back URLs always uses the UTF-8 encoding, even though composing 
URLs continues to use the wiki encoding. This is a bug, which prevents 
changing the encoding to anything other than UTF-8, and it should be fixed.

Now, we have two options:

1. Reintroduce the old behavior, so that URLs always use the wiki 
encoding. This is a direct bugfix.
2. Also change the encoding part, so that UTF-8 is always used. This is 
an improvement, going towards better compliance with web standards.

Personally I think that the second option is the better one, but it 
requires a vote, since it has a few drawbacks.

Advantages:
+ better compliance with web standards, since UTF-8 is the recommended 
encoding for URLs (although not imposed)
+ support for a wider range of document names, since UTF-8 allows 
full-unicode document names, while ISO-8859-1 limits names to latin1 
characters
+ better support from browsers, since entering accented characters 
directly in the address bar encodes the URL sent to the server using 
UTF-8, and decoding the URL also assumes UTF-8; this means that a 
document named "é" will be printed as .../view/Main/%E9 and will have to 
be entered the same way in the address bar when ISO-8859-1 is used, and 
as .../view/Main/é when UTF-8 is used

Drawbacks:
- by default Tomcat uses ISO-8859-1 as the encoding for URLs, so the 
Tomcat configuration will have to be changed as in 
http://platform.xwiki.org/xwiki/bin/view/AdminGuide/Encoding#HTomcat
- some existing bookmarks will not work anymore once the encoding is changed

+1 for option 2 from me,
-- 
Sergiu Dumitriu
http://purl.org/net/sergiu/
_______________________________________________
devs mailing list
[email protected]
http://lists.xwiki.org/mailman/listinfo/devs

Reply via email to