Re: unicode URIs

Naoki Hotta Wed, 16 May 2001 10:49:39 -0700
The proposal discussing here is about the internal format of URI in 
mozilla. That is not supposed to affect what mozilla sends out to servers.
The point I mentioned was about the cases which escaped URI and cannot 
be safely convert to UTF-8, so mozilla need to keep the URI unescaped 
for those cases.

Naoki

Paul Deuter wrote:

> Even though I am not a Mozilla user, I have been reading the discussion of
> Unicode URIs with great interest.  
> I think that Naoki is exactly correct: the %HH format is already extensively
> used and is context sensitive.  The character encoding is an agreement
> between the sender and receiver.  The encoding is not always (indeed rarely)
> UTF-8.
> 
> Rather I believe there is a need for a new encoding format explicitly for
> Unicode.  I like the %uHHHH format because it is already in use by many user
> agents and already correctly decoded by some servers.  But whatever format
> is chosen, I would just like to see something that says explicitly "I am a
> Unicode codepoint".  I don't believe that the %HH format can be used as this
> explicit Unicode format, because the %HH is already used by lots of software
> to specify other character sets (see Naoki's examples below).
> 
> -Paul Deuter
> Plumtree Software
> 
> -----Original Message-----
> From: Naoki Hotta [mailto:[EMAIL PROTECTED]]
> Sent: Tuesday, May 15, 2001 2:22 PM
> To: [EMAIL PROTECTED]; [EMAIL PROTECTED]
> Subject: Re: unicode URIs
> 
> 
> 
>> Second phase:
>> - ASCII % encoding would be removed from the url implementation(s), and
> 
> pushed out to the protocols who need it. Callers expecting the encoding
> would also need to be repaired to handle the new UTF8 format.
> 
> 
> There are cases which %-escape cannot be unescaped by client side.
> I have examples here (search results from different engines).
> 
> I searched "baseball" in Japanese, which takes two characters.
> In typical Japanese charsets, they are reprezented as 4 bytes (2 bytes
> per character).
> 
> 1)
> http://search.yahoo.co.jp/bin/search?p=%CC%EE%B5%E5
> 
> 2)
> http://search.netscape.com/ja/search.tmpl?charset=x-sjis&cp=nsiwidsrc&;
> cat=World/Japanese&search=%96%EC%8B%85
> 
> 3)
> http://www.google.com/search?q=%96%EC%8B%85&btnG=Google+%8C%9F%8D%F5&hl=ja&l
> r=
> 
> * The first example, the charset is "EUC-JP" but you don't really know 
> the charset by just looking at the URI.
> * The second one is "x-sjis" (alias of "Shift_JIS") which is in the 
> query part but that is supposed to be parsed by the server.
> * The third case, it's "Shift_JIS" (the same charset as the second case)
> but again the client has no way to know. Also there is an additional
> escaped string "%8C%9F%8D%F5" which I have no idea what that is (it
> could be a binary data instead of a text).
> 
> So client cannot always unescape URI when the URI is already escaped by 
> the server or placed in a document escaped (e.g. in "HREF=").
> So I think we need exception cases to allow %-escaped representation in 
> necko.
> 
> Naoki
Re: unicode URIs

Reply via email to