[ https://issues.apache.org/jira/browse/NUTCH-546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12526667 ]
Hudson commented on NUTCH-546: ------------------------------ Integrated in Nutch-Nightly #204 (See [http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/204/]) > file URL are filtered out by the crawler > ---------------------------------------- > > Key: NUTCH-546 > URL: https://issues.apache.org/jira/browse/NUTCH-546 > Project: Nutch > Issue Type: Bug > Components: fetcher > Affects Versions: 1.0.0 > Environment: Windows XP > Nutch trunk from Monday, August 20th 2007 > Reporter: Marc Brette > Assignee: Doğacan Güney > Fix For: 1.0.0 > > Attachments: NUTCH-546-validator-plugin_v1.patch, NUTCH-546.patch > > > I tried to index file system using the file:/ protocol, which worked fine in > version 0.9 > The file URL are being filtered out and not fetched at all. > I investigated the code and saw that there are 2 issues: > 1) One is with the class UrlValidator: when validating an URL, it check the > 'authority', a combination of host and port. As it is null for file, the URL > is rejected. > 2) Once this check is removed, files that contain space characters (and maybe > other characters to be URL encoded) are also filtered out. It maybe be > because the file protocol plugin doesn't URL encode space characters and/or > UrlValidator is enforce the rule to encode such character. > To workaround these issues, I just commented out UrlValidator checks and it > works fine. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.