Hi Scott, I don't think we are in contradiction here, but in point of view. I am saying that we should not encode what is an allowed character. If the URL is already present somewhere like (http://example.com/hj/hj) there is not need to encode and if it is present like (http://example.com%2f/test) there is no need to decode. And what you should do if you get such a URL, just do not touch it, because it contains no invalid character.
@URL dictionary: Are you kidding?? I was obviously referring to the same RFC. I will like you to think for a moment and tell me what will you gain by breaking the URL into components and then encoding it and then joining it again. Consider this problem statement: You are given a URL, which is extracted from a source HTML of a webpage, and you need to access it using drupal_http_request(). I am, of course, interesting in improving what I currently have in hand. "Fire me all you can, but cast me into a solid and beautiful pot" -- Regards, Nitin Kumar Gupta http://publicmind.in/blog/ On Sun, Mar 14, 2010 at 10:50 AM, Scott Reynen <[email protected]>wrote: > On Mar 13, 2010, at 8:20 PM, nitin gupta wrote: > > I completely agree to what you and Scott are trying to say. But, I am not >> looking to create an URL, just to sanitize it to remove disallowed >> character, i.e. what a browser would do while accessing a URL when a user >> inputs an URL. Consider, I parse the following URL from XML: >> >> http://example.com?test/com >> >> Do you think I should encode the '/' in the query part i.e. [test/com]?? >> > > Technically, yes, but that's beside the point. Regardless of how strictly > you choose to apply URL encoding, you should be applying it to specific URL > parts, not full URLs. > > > I don't think we need to. (Nor will Firefox, if you enter this URL in the >> address bar). >> > > You're right that encoding the slash character isn't particularly important > in the query. In a path segment, however, the difference between encoded > and unencoded slashes is very significant; http://example.com/a/b/c is > different than http://example.com/a%2fb/c. And a slash definitely > shouldn't be encoded where it's used as a delimiter between URL components. > This is actually a good example of why encoding must be applied to > individual URL components, not the full URL. > > > If a URL contains characters which are allowed in the URL dictionary, will >> we ever need to encode those characters? No. >> > > What is the URL dictionary? Here's one of the relevant RFC on URLs: > > http://www.ietf.org/rfc/rfc3986.txt > > Selected quotes: > > "A percent-encoding mechanism is used to represent a data octet > _in_a_component_" > "the conflicting data must be percent-encoded _before_the_URI_is_formed_" > > Emphasis added to, well, emphasize that encoding applies to component > parts. > > -- > Scott Reynen > MakeDataMakeSense.com > > >
