[
https://issues.apache.org/jira/browse/NUTCH-546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Doğacan Güney reassigned NUTCH-546:
-----------------------------------
Assignee: Doğacan Güney
> file URL are filtered out by the crawler
> ----------------------------------------
>
> Key: NUTCH-546
> URL: https://issues.apache.org/jira/browse/NUTCH-546
> Project: Nutch
> Issue Type: Bug
> Components: fetcher
> Affects Versions: 1.0.0
> Environment: Windows XP
> Nutch trunk from Monday, August 20th 2007
> Reporter: Marc Brette
> Assignee: Doğacan Güney
> Attachments: NUTCH-546.patch
>
>
> I tried to index file system using the file:/ protocol, which worked fine in
> version 0.9
> The file URL are being filtered out and not fetched at all.
> I investigated the code and saw that there are 2 issues:
> 1) One is with the class UrlValidator: when validating an URL, it check the
> 'authority', a combination of host and port. As it is null for file, the URL
> is rejected.
> 2) Once this check is removed, files that contain space characters (and maybe
> other characters to be URL encoded) are also filtered out. It maybe be
> because the file protocol plugin doesn't URL encode space characters and/or
> UrlValidator is enforce the rule to encode such character.
> To workaround these issues, I just commented out UrlValidator checks and it
> works fine.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.