On Mar 13, 2010, at 8:20 PM, nitin gupta wrote:

I completely agree to what you and Scott are trying to say. But, I am not looking to create an URL, just to sanitize it to remove disallowed character, i.e. what a browser would do while accessing a URL when a user inputs an URL. Consider, I parse the following URL from XML:

http://example.com?test/com

Do you think I should encode the '/' in the query part i.e. [test/ com]??

Technically, yes, but that's beside the point. Regardless of how strictly you choose to apply URL encoding, you should be applying it to specific URL parts, not full URLs.

I don't think we need to. (Nor will Firefox, if you enter this URL in the address bar).

You're right that encoding the slash character isn't particularly important in the query. In a path segment, however, the difference between encoded and unencoded slashes is very significant; http://example.com/a/b/c is different than http://example.com/a%2fb/c. And a slash definitely shouldn't be encoded where it's used as a delimiter between URL components. This is actually a good example of why encoding must be applied to individual URL components, not the full URL.

If a URL contains characters which are allowed in the URL dictionary, will we ever need to encode those characters? No.

What is the URL dictionary?  Here's one of the relevant RFC on URLs:

http://www.ietf.org/rfc/rfc3986.txt

Selected quotes:

"A percent-encoding mechanism is used to represent a data octet _in_a_component_" "the conflicting data must be percent-encoded _before_the_URI_is_formed_"

Emphasis added to, well, emphasize that encoding applies to component parts.

--
Scott Reynen
MakeDataMakeSense.com


Reply via email to