Re: unicode URIs

Naoki Hotta Tue, 15 May 2001 13:55:25 -0700
> Second phase:
> - ASCII % encoding would be removed from the url implementation(s), and pushed out 
>to the protocols who need it. Callers expecting the encoding would also need to be 
>repaired to handle the new UTF8 format.
> 

There are cases which %-escape cannot be unescaped by client side.
I have examples here (search results from different engines).

I searched "baseball" in Japanese, which takes two characters.
In typical Japanese charsets, they are reprezented as 4 bytes (2 bytes
per character).

1)
http://search.yahoo.co.jp/bin/search?p=%CC%EE%B5%E5

2)
http://search.netscape.com/ja/search.tmpl?charset=x-sjis&cp=nsiwidsrc&;
cat=World/Japanese&search=%96%EC%8B%85

3)
http://www.google.com/search?q=%96%EC%8B%85&btnG=Google+%8C%9F%8D%F5&hl=ja&lr=

* The first example, the charset is "EUC-JP" but you don't really know 
the charset by just looking at the URI.
* The second one is "x-sjis" (alias of "Shift_JIS") which is in the 
query part but that is supposed to be parsed by the server.
* The third case, it's "Shift_JIS" (the same charset as the second case)
but again the client has no way to know. Also there is an additional
escaped string "%8C%9F%8D%F5" which I have no idea what that is (it
could be a binary data instead of a text).

So client cannot always unescape URI when the URI is already escaped by 
the server or placed in a document escaped (e.g. in "HREF=").
So I think we need exception cases to allow %-escaped representation in 
necko.

Naoki
Re: unicode URIs

Reply via email to