Re: [Python-Dev] bug in urlparse

Duncan Booth Tue, 06 Sep 2005 04:51:27 -0700

[EMAIL PROTECTED] wrote in news:[EMAIL PROTECTED]:


> According to RFC 2396[1] section 5.2:
> 
>       g) If the resulting buffer string still begins with one or more
>          complete path segments of "..", then the reference is
>          considered to be in error.  Implementations may handle this
>          error by retaining these components in the resolved path (i.e.,
>          treating them as part of the final URI), by removing them from
>          the resolved path (i.e., discarding relative levels above the
>          root), or by avoiding traversal of the reference.
> 
> If I read this right, it explicitly allows the urlparse.urljoin behavior
> ("handle this error by retaining these components in the resolved path").
> 

Yes, the urljoin behaviour is explicitly allowed, however it is not the 
most commonly implemented permitted behaviour. Both IE and Mozilla/Firefox 
handle this error by stripping the spurious .. elements from the front of 
the path. Apache, and I hope other web servers, work by the third permitted 
method, i.e. rejecting requests to these invalid urls.

The net effect of this is that on some sites using a Python spider (e.g. 
webchecker.py) will produce a large number of error messages for links 
which browsers will actually resolve successfully. (At least that's when I 
first noticed this particular problem). Depending on your reasons for 
spidering a site this can be either a good thing or an annoyance.
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bug in urlparse

Reply via email to