[
https://issues.apache.org/jira/browse/NUTCH-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13968882#comment-13968882
]
Sebastian Nagel commented on NUTCH-1748:
----------------------------------------
Hi [~Sertac Turkel], thanks, +1 for the unit tests.
I'm not sure about the original intention of urlfilter-validator (and its
source [commons'
UrlValidator|http://commons.apache.org/proper/commons-validator/javadocs/api-1.4.0/org/apache/commons/validator/routines/UrlValidator.html]):
it's not the exclusion of URLs containing dot elements in the path (sorry,
I've been wrong). Otherwise, counting ".." and slashes in the path and
comparing their numbers is rather naive and does not check anything in a
systematic way:
{code}
assertNotNull(url_validator.filter("http://alfa.bravo.pi/a/../..")); // fails
assertNotNull(url_validator.filter("http://alfa.bravo.pi/a/./././../..")); //
succeeds!
{code}
Maybe the intention was to exclude paths which go "beyond" the server root if
there are too many ".." elements. But behaviour is explicitly defined in
[RFC3986 remove_dot_segments|http://tools.ietf.org/html/rfc3986#section-5.2.4]
and modern browsers resolve (normalize) such URLs correctly.
In general, it would make sense to reject any URLs containing dot elements or
empty elements in the path: "The complete path segments '.' and '..' are
intended only for use within relative references"
([RFC3896|http://tools.ietf.org/html/rfc3986#section-6.2.2.3]). However, this
would require some more work.
Comments are welcome about the desired behaviour!
> urlfilter-validator to allow .. (two dots) inside file names (path elements)
> ----------------------------------------------------------------------------
>
> Key: NUTCH-1748
> URL: https://issues.apache.org/jira/browse/NUTCH-1748
> Project: Nutch
> Issue Type: Bug
> Affects Versions: 2.2.1
> Reporter: Sertac TURKEL
> Priority: Minor
> Fix For: 2.3
>
> Attachments: NUTCH-1748.patch
>
>
> Unix systems accept files containing two dots "abc..xyz.txt". So
> urlfilter-validator should not reject this kind of urls. Also paths
> containing "/../" or "/.." in final position should be still rejected.
--
This message was sent by Atlassian JIRA
(v6.2#6252)