On 12/26/2009 02:29 PM, Vincent Massol wrote:
>
> On Dec 26, 2009, at 2:17 PM, Sergiu Dumitriu wrote:
>
>> Hi devs,
>>
>> The short version:
>>
>> Should we always use UTF-8 for encoding and decoding URLs,
>> regardless of
>> the wiki encoding, for better compliance with web standards?
>>
>>
>> The long version:
>>
>> By definition, URLs can only contain ASCII characters, everything else
>> must be converted to their corresponding bytes and escaped as %XY
>> escapes. The problem is that "their corresponding bytes" implies a
>> charset + encoding, and no specification *enforces* a specific pair,
>> although it is *recommended* to use Unicode + UTF8, to comply with the
>> modern tendency of the web in general.
>>
>> Traditionally, XWiki has been using the same encoding as the
>> configured
>> global wiki encoding for the URLs, which means that before 1.9, when
>> we
>> switched to UTF8 as the default wiki encoding, all URLs were using the
>> ISO-8859-1 encoding. Since the switch to UTF-8, URLs are also using
>> the
>> UTF-8 encoding by default, although the wiki encoding can be changed.
>>
>> Now, since 2.1, a bugfix accidentally changed the behavior, so that
>> parsing back URLs always uses the UTF-8 encoding, even though
>> composing
>> URLs continues to use the wiki encoding. This is a bug, which prevents
>> changing the encoding to anything other than UTF-8, and it should be
>> fixed.
>>
>> Now, we have two options:
>>
>> 1. Reintroduce the old behavior, so that URLs always use the wiki
>> encoding. This is a direct bugfix.
>> 2. Also change the encoding part, so that UTF-8 is always used. This
>> is
>> an improvement, going towards better compliance with web standards.
>>
>> Personally I think that the second option is the better one, but it
>> requires a vote, since it has a few drawbacks.
>>
>> Advantages:
>> + better compliance with web standards, since UTF-8 is the recommended
>> encoding for URLs (although not imposed)
>
> Is there any reference to this? Some RFC that we could quote in the
> code?

The URL RFC predates the wide adoption of UTF, so it does not mention 
any encoding (see http://tools.ietf.org/html/rfc1738#section-2.2 ). This 
is why I said that there's no enforcement. However, the URI RFC, which 
is a generalization of URLs, enforces UTF-8 (see 
http://tools.ietf.org/html/rfc3986#section-2.5 ).

The URI RFC officially *updates* the URL RFC, so we can say that URLs 
are currently standardized by the new, UTF-8 enforcing RFC 3986, 
although for backwards compatibility URLs can still be used with the RFC 
1738 definition.

>> + support for a wider range of document names, since UTF-8 allows
>> full-unicode document names, while ISO-8859-1 limits names to latin1
>> characters
>> + better support from browsers, since entering accented characters
>> directly in the address bar encodes the URL sent to the server using
>> UTF-8,
>
> Is that true for all browsers? Is there a standard?

FF, Opera, Chrome.

Konqueror even displays %E9 as an invalid UTF character <?>, so it 
assumes even more that the URL is in UTF-8.

IE6 does not automatically convert %XY escapes to their equivalent 
character, so it displays both %E9 and %C3%A9 (the utf encoding for é). 
However, entering é in the address bar converts to UTF-8 bytes. Also 
note that IE6 predates the RFC 3986 by several years, so it has the 
right not to assume UTF-8 in URLs.

>> and decoding the URL also assumes UTF-8; this means that a
>> document named "é" will be printed as .../view/Main/%E9 and will
>> have to
>> be entered the same way in the address bar when ISO-8859-1 is used,
>> and
>> as .../view/Main/é when UTF-8 is used
>>
>> Drawbacks:
>> - by default Tomcat uses ISO-8859-1 as the encoding for URLs, so the
>> Tomcat configuration will have to be changed as in
>> http://platform.xwiki.org/xwiki/bin/view/AdminGuide/Encoding#HTomcat
>
> Any known issue logged against Tomcat? Any planned fix? If this is
> really a web standard why wouldn't Tomcat change this behavior?

According to http://wiki.apache.org/tomcat/Tomcat/UTF-8 , Tomcat 
developers interpret RFC 2616 (the HTTP 1.1 specification) as to 
recommend the ISO-8859-1 charset as the default. However, this is not 
true. The RFC references the URI RFC as the standard for the requested 
address, and the 8859-1 is the default for the response *body* only. The 
original URI RFC referenced by the HTTP RFC, 
http://www.ietf.org/rfc/rfc2396.txt , does not recommend a default 
charset+encoding, although it only mentions UTF-8 as a possible 
encoding, and doesn't mention ISO-8859-1 at all. So I'd rather say that 
the HTTP 1.1 specification does not recommend any default encoding for 
addresses, rather than ISO-8859-1, and it hints towards UTF-8. And that 
RFC has been deprecated in favor of RFC 3986, which clearly states that 
UTF-8 is used. Perhaps I should send this interpretation to the Tomcat guys.

>> - some existing bookmarks will not work anymore once the encoding is
>> changed
>>
>> +1 for option 2 from me,
>
> Before voting I'd like to see how "standard" this is. If it's really
> standard and we have "official" standard docs to back this then I
> agree that we should choose 2.

IMO, RFC 3986 is the current standard and the one we should follow, and 
it does specify UTF-8 as the ONLY encoding, not just the default.

-- 
Sergiu Dumitriu
http://purl.org/net/sergiu/
_______________________________________________
devs mailing list
[email protected]
http://lists.xwiki.org/mailman/listinfo/devs

Reply via email to