Again quoting the RFC: >>>
For original character sequences that contain non-ASCII characters, however, the situation is more difficult. Internet protocols that transmit octet sequences intended to represent character sequences are expected to provide some way of identifying the charset used, if there might be more than one [RFC2277]. However, there is currently no provision within the generic URI syntax to accomplish this identification. An individual URI scheme may require a single charset, define a default charset, or provide a way to indicate the charset used.
It is expected that a systematic treatment of character encoding within URI will be developed as a future modification of this specification.
<<<
I quess http schema sticks to US-ASCII for now. But maybe with escapes you could access on some web servers pages like http://aku.suomi.fi/k%E4%E4k.html = http://aku.suomi.fi/k��k.html To be honest I don't know.
Also I don't know if the "systemic treatment" has already happened or when it will happen.
So it is up to us to decide how we deal with charsets.
Since vfs is written in java it would make sense to first turn the character sequence of to 16 bit unicode (UTF-16?) And then encode every character above US-ASCII (7 bit) or ISO-LATIN-1 (8 bit).
But this would not make the visual representation of URI
very nice. According to URI spec one should be able to read URI
on the radio :-) If you are in japan every character would be encoded
and very difficult to read for the announcer.
But if you don't encode then that URI would look to westerner
a sequence of those boxes that represent character for which there is no font.
Let's get practical.
Someone wrote the following uri in ant build file (and some ant task
uses vfs).
webdav:/h�h/k��k.ini
Ant when reading the string knows that it is encoded in iso-latin-1
But the string in jvm is in unicode.
Ant gives this string (uri) to vfs that encodes all character above us-ascii.
so it is now
webdav:/h%F6h/k%E4%E4k.ini
Now webdav provider makes http request let's say to tomcat.
Question arises:
Can tomcat handle (or the webdav protocol spec) unicode characters
in resource names?
I don't know.
But maybe webdav provider implementor knows.
So if webdav names only handle us-ascii then the provider
can right away say when it is asked to canonicalize the
uri that this is not a proper webdav uri.
Or maybe this is not specified. And some webdav servers could handle the uri and some could not. Maybe webdav provider then could ask the server what it supports. But maybe there is no one standardized single way to ask this.
At this point a sane person starts to give up and thinks:"Whatever!" Just pass the string and let the user handle errors.
But let's say that webdav can handle iso-latin-1 and the request is sent to server.
The server's filesystem is encoded in some other coding (EBCDIC?) that maps � and � to a different number. So in order to do the mapping the webdav server would need to know what character encoding vfs uses (UTF-16) in order to do this.
But since this is not specified (at least in the rfc I am quoting) then it would probably unescape using it's own encoding and request a wrong resource from it's filesystem.
This state of affairs makes me wonder do the standard makers really want to make standards or do they just pretend. The answer is of course that industry wants to make standards to a point. Because confusion and protectionism makes IT business thrive.
That being said I think one pragmatic approach could be to treat uri characters to be in from unicode character set. When transported they would be in US-ASCII where everything above us-ascii is escaped.
So to answer your question � = %FB
But all this is just assuming and making things up. I quess the decision is in your hands since you write the code.
- rami
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
