Re: unicode URIs

Neal Probert Thu, 17 May 2001 08:32:55 -0700
I whole heartedly agree with using pure UTF-8 internally and only 
escaping the data before going out onto the wire.

In escaping the UTF-8, follow the usual rules for 7 bit ascii plus 
escape everything between 0x80 and 0xff.  This works well with Perl 5.6 
or higher using the CGI Module along with the use utf8; pragma.

Please avoid the %uHHHH notation for escaping as it is not truly defined 
as a standard, unless you can show me an RFC or STD document for it. 
Support for this is a mixed bag out there.

Please do not normalize, case fold or do any case conversion on the 
UTF-8.  MSIE does this, treating 0x80 through 0xff as if it were latin-1 
and that causes problems.

Naoki Hotta wrote:

> The proposal discussing here is about the internal format of URI in 
> mozilla. That is not supposed to affect what mozilla sends out to servers.
> The point I mentioned was about the cases which escaped URI and cannot 
> be safely convert to UTF-8, so mozilla need to keep the URI unescaped 
> for those cases.
> 
> Naoki
> 
> Paul Deuter wrote:
> 
>> Even though I am not a Mozilla user, I have been reading the 
>> discussion of
>> Unicode URIs with great interest.  I think that Naoki is exactly 
>> correct: the %HH format is already extensively
>> used and is context sensitive.  The character encoding is an agreement
>> between the sender and receiver.  The encoding is not always (indeed 
>> rarely)
>> UTF-8.
>>
>> Rather I believe there is a need for a new encoding format explicitly for
>> Unicode.  I like the %uHHHH format because it is already in use by 
>> many user
>> agents and already correctly decoded by some servers.  But whatever 
>> format
>> is chosen, I would just like to see something that says explicitly "I 
>> am a
>> Unicode codepoint".  I don't believe that the %HH format can be used 
>> as this
>> explicit Unicode format, because the %HH is already used by lots of 
>> software
>> to specify other character sets (see Naoki's examples below).
>>
>> -Paul Deuter
>> Plumtree Software
>>
>> -----Original Message-----
>> From: Naoki Hotta [mailto:[EMAIL PROTECTED]]
>> Sent: Tuesday, May 15, 2001 2:22 PM
>> To: [EMAIL PROTECTED]; [EMAIL PROTECTED]
>> Subject: Re: unicode URIs
>>
>>
>>
>>> Second phase:
>>> - ASCII % encoding would be removed from the url implementation(s), and
>>
>>
>> pushed out to the protocols who need it. Callers expecting the encoding
>> would also need to be repaired to handle the new UTF8 format.
>>
>>
>> There are cases which %-escape cannot be unescaped by client side.
>> I have examples here (search results from different engines).
>>
>> I searched "baseball" in Japanese, which takes two characters.
>> In typical Japanese charsets, they are reprezented as 4 bytes (2 bytes
>> per character).
>>
>> 1)
>> http://search.yahoo.co.jp/bin/search?p=%CC%EE%B5%E5
>>
>> 2)
>> http://search.netscape.com/ja/search.tmpl?charset=x-sjis&cp=nsiwidsrc&;
>> cat=World/Japanese&search=%96%EC%8B%85
>>
>> 3)
>> http://www.google.com/search?q=%96%EC%8B%85&btnG=Google+%8C%9F%8D%F5&hl=ja&l 
>>
>> r=
>>
>> * The first example, the charset is "EUC-JP" but you don't really know 
>> the charset by just looking at the URI.
>> * The second one is "x-sjis" (alias of "Shift_JIS") which is in the 
>> query part but that is supposed to be parsed by the server.
>> * The third case, it's "Shift_JIS" (the same charset as the second case)
>> but again the client has no way to know. Also there is an additional
>> escaped string "%8C%9F%8D%F5" which I have no idea what that is (it
>> could be a binary data instead of a text).
>>
>> So client cannot always unescape URI when the URI is already escaped 
>> by the server or placed in a document escaped (e.g. in "HREF=").
>> So I think we need exception cases to allow %-escaped representation 
>> in necko.
>>
>> Naoki
> 
>
Re: unicode URIs

Reply via email to