I whole heartedly agree with using pure UTF-8 internally and only
escaping the data before going out onto the wire.
In escaping the UTF-8, follow the usual rules for 7 bit ascii plus
escape everything between 0x80 and 0xff. This works well with Perl 5.6
or higher using the CGI Module along with the use utf8; pragma.
Please avoid the %uHHHH notation for escaping as it is not truly defined
as a standard, unless you can show me an RFC or STD document for it.
Support for this is a mixed bag out there.
Please do not normalize, case fold or do any case conversion on the
UTF-8. MSIE does this, treating 0x80 through 0xff as if it were latin-1
and that causes problems.
Naoki Hotta wrote:
> The proposal discussing here is about the internal format of URI in
> mozilla. That is not supposed to affect what mozilla sends out to servers.
> The point I mentioned was about the cases which escaped URI and cannot
> be safely convert to UTF-8, so mozilla need to keep the URI unescaped
> for those cases.
>
> Naoki
>
> Paul Deuter wrote:
>
>> Even though I am not a Mozilla user, I have been reading the
>> discussion of
>> Unicode URIs with great interest. I think that Naoki is exactly
>> correct: the %HH format is already extensively
>> used and is context sensitive. The character encoding is an agreement
>> between the sender and receiver. The encoding is not always (indeed
>> rarely)
>> UTF-8.
>>
>> Rather I believe there is a need for a new encoding format explicitly for
>> Unicode. I like the %uHHHH format because it is already in use by
>> many user
>> agents and already correctly decoded by some servers. But whatever
>> format
>> is chosen, I would just like to see something that says explicitly "I
>> am a
>> Unicode codepoint". I don't believe that the %HH format can be used
>> as this
>> explicit Unicode format, because the %HH is already used by lots of
>> software
>> to specify other character sets (see Naoki's examples below).
>>
>> -Paul Deuter
>> Plumtree Software
>>
>> -----Original Message-----
>> From: Naoki Hotta [mailto:[EMAIL PROTECTED]]
>> Sent: Tuesday, May 15, 2001 2:22 PM
>> To: [EMAIL PROTECTED]; [EMAIL PROTECTED]
>> Subject: Re: unicode URIs
>>
>>
>>
>>> Second phase:
>>> - ASCII % encoding would be removed from the url implementation(s), and
>>
>>
>> pushed out to the protocols who need it. Callers expecting the encoding
>> would also need to be repaired to handle the new UTF8 format.
>>
>>
>> There are cases which %-escape cannot be unescaped by client side.
>> I have examples here (search results from different engines).
>>
>> I searched "baseball" in Japanese, which takes two characters.
>> In typical Japanese charsets, they are reprezented as 4 bytes (2 bytes
>> per character).
>>
>> 1)
>> http://search.yahoo.co.jp/bin/search?p=%CC%EE%B5%E5
>>
>> 2)
>> http://search.netscape.com/ja/search.tmpl?charset=x-sjis&cp=nsiwidsrc&
>> cat=World/Japanese&search=%96%EC%8B%85
>>
>> 3)
>> http://www.google.com/search?q=%96%EC%8B%85&btnG=Google+%8C%9F%8D%F5&hl=ja&l
>>
>> r=
>>
>> * The first example, the charset is "EUC-JP" but you don't really know
>> the charset by just looking at the URI.
>> * The second one is "x-sjis" (alias of "Shift_JIS") which is in the
>> query part but that is supposed to be parsed by the server.
>> * The third case, it's "Shift_JIS" (the same charset as the second case)
>> but again the client has no way to know. Also there is an additional
>> escaped string "%8C%9F%8D%F5" which I have no idea what that is (it
>> could be a binary data instead of a text).
>>
>> So client cannot always unescape URI when the URI is already escaped
>> by the server or placed in a document escaped (e.g. in "HREF=").
>> So I think we need exception cases to allow %-escaped representation
>> in necko.
>>
>> Naoki
>
>