file URL are filtered out by the crawler
----------------------------------------

                 Key: NUTCH-546
                 URL: https://issues.apache.org/jira/browse/NUTCH-546
             Project: Nutch
          Issue Type: Bug
          Components: fetcher
    Affects Versions: 1.0.0
         Environment: Windows XP
Nutch trunk from Monday, August 20th 2007
            Reporter: Marc Brette


I tried to index file system using the file:/ protocol, which worked fine in 
version 0.9
The file URL are being filtered out and not fetched at all.

I investigated the code and saw that there are 2 issues:
1) One is with the class UrlValidator: when validating an URL, it check the 
'authority', a combination of host and port. As it is null for file, the URL is 
rejected.
2) Once this check is removed, files that contain space characters (and maybe 
other characters to be URL encoded) are also filtered out. It maybe be because 
the file protocol plugin doesn't URL encode space characters and/or 
UrlValidator is enforce the rule to encode such character.

To workaround these issues, I just turned all UrlValidator checks and it works 
fine.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to