On Mar 13, 2010, at 8:20 PM, nitin gupta wrote:
I completely agree to what you and Scott are trying to say. But, I
am not looking to create an URL, just to sanitize it to remove
disallowed character, i.e. what a browser would do while accessing a
URL when a user inputs an URL. Consider, I parse the following URL
from XML:
http://example.com?test/com
Do you think I should encode the '/' in the query part i.e. [test/
com]??
Technically, yes, but that's beside the point. Regardless of how
strictly you choose to apply URL encoding, you should be applying it
to specific URL parts, not full URLs.
I don't think we need to. (Nor will Firefox, if you enter this URL
in the address bar).
You're right that encoding the slash character isn't particularly
important in the query. In a path segment, however, the difference
between encoded and unencoded slashes is very significant; http://example.com/a/b/c
is different than http://example.com/a%2fb/c. And a slash
definitely shouldn't be encoded where it's used as a delimiter between
URL components. This is actually a good example of why encoding must
be applied to individual URL components, not the full URL.
If a URL contains characters which are allowed in the URL
dictionary, will we ever need to encode those characters? No.
What is the URL dictionary? Here's one of the relevant RFC on URLs:
http://www.ietf.org/rfc/rfc3986.txt
Selected quotes:
"A percent-encoding mechanism is used to represent a data octet
_in_a_component_"
"the conflicting data must be percent-encoded
_before_the_URI_is_formed_"
Emphasis added to, well, emphasize that encoding applies to component
parts.
--
Scott Reynen
MakeDataMakeSense.com