Re: unicode URIs

Chak Nanga Thu, 10 May 2001 11:58:20 -0700

Correction: The first line should read:

Also, please note that URLs in hrefs which have non-ascii chars are not currently being UTF-8 encoded (while the URLs typed in the UrlBar are)

Chak Nanga wrote:

Also, please note that URLs in hrefs which have non-ascii chars are not currently being UTF-8 encoded (like their counterparts typed in the UrlBar)
To test this:
Please open the attached test page in mozilla and mouse over the "Click Me" link and notice in the status bar - the URL will be displayed as http://www.dmoz.org/World/Fran%e7ais/ where "%e7" is the escaped value for 'ç'in http://www.dmoz.org/World/Français/
Clicking on the link will take you to the correct page.
Now, type http://www.dmoz.org/World/Français/ in the UrlBar and the page will not be found due to the UTF-8 encoding of the url.
Thanks
Chak

Judson Valeski wrote:
I've been hearing rumblings from various folks about URI encodings.
Within the last week people have suggested making *all* URIs totally
flat ASCII (char*, %escaping, *no* char encoding), to making them
unicode. I've put a 1 hour meeting together tomorrow (Friday, May 11th)
at 1pm Pacific time. To discuss the issues and the model we want to support.
If you're at Netscape, goto the Quincy conf. room. Otherwise, dial in
using the following (and yes, I've officially changed my name to "Mr.
Judson Valeski" :-)):
USA Toll Free Number: 888-282-0360
PASSCODE: 47954
LEADER: Mr. Judson Valeski
Currently we're using the old/traditional way to represent URIs which is to % escape a set of characters defined in the URL spec. That doesn't cover unicode or UTF8 encoding. The reason this issue is being raised is because we have existing bugs that are forcing it to the foreground.
I see four layers here.
1. UI layer. It's possible for me to type unicode into a URL bar, and it's
possible that I'm viewing unicode content in the browser window that has
unicode links in there that, when I hover over them, I want to have them
display as unicode (not encoded or escaped).
2. Loading layer. This is the uriloader/top-level-necko/docshell layer that
takes strings from the UI level, and hands them off to protocol handlers.
3. Protocol handling layer. Some protocols want to play w/ Unicode (UTF8 most
likely) and some don't (HTTP for example).
4. DNS layer. IDNS is a proposed standard that allows for UTF8 (right frank?)
hostnames.
5(?). the IP transport layer. I'm probably erroneously ignoring this level.
If I'm reading everyone's needs correctly here, we need to hash out what each
layer needs to do to support, at least, UTF8 (a unicode encoding) URL's. From
10k feet, it seems that we can tinker w/ interfaces, and just say it's up to a
protocol impl to determine whether or not they can handle the non ASCII data.
I'd prefer not to spend a lot of time in the meeting talking about ficticious worlds where flat char*'s don't exist and life is represented in unicode. Our master here is reality (*not* RFCs and specs), and we don't want to spend cycles over-planning and disrupting the current code-base to handle some edge case.
Jud
reference:
- nsIURI definition: http://lxr.mozilla.org/seamonkey/source/netwerk/base/public/nsIURI.idl
- current necko utility uri creation function, http://lxr.mozilla.org/seamonkey/source/netwerk/base/public/nsNetUtil.h#81 , notice the UTF8 encoding call.
- bug on that uri creation function for doing the UTF8 encoding http://bugzilla.mozilla.org/show_bug.cgi?id=66515
- new uri scheme proposal http://www.ietf.org/rfc/rfc2718.txt
- uri's ftp://ftp.isi.edu/in-notes/rfc2396.txt
- nice'n'nasty non-ascii chars in the spec http://www.w3.org/TR/html4/appendix/notes.html#h-B.2.1
- LDAP UTF8 needs ftp://ftp.isi.edu/in-notes/rfc2253.txt ([EMAIL PROTECTED] has a bug against him to support this).
- LDAP url format ftp://ftp.isi.edu/in-notes/rfc2255.txt
- IMAP urls ftp://ftp.isi.edu/in-notes/rfc2192.txt
This is a test....
Click Me

Re: unicode URIs

Reply via email to