[
https://issues.apache.org/jira/browse/NUTCH-546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12522852
]
Doğacan Güney commented on NUTCH-546:
-------------------------------------
This is true, I missed it when committing UrlValidator. I guess we can change
UrlValidator to only validate (that is do the full authority check and etc)
URLs with schemes http,https and ftp (is there any other?) and automatically
validate anything with a different scheme. This should be done before the ASCII
pattern check since files can have non-ascii characters in them so this check
has to go too.
So, it can go like this:
1) Make sure that the url has ":/" in it. This will be helpful in eliminating
noise (I think all urls must have a scheme part).
2) Check if url starts with http,https or ftp. If it doesn't start with any,
return true (to indicate that the url is valid).
3) If the url starts with one, run the validation code.
(It would be nice if we had some way of running some sort of validation on
file:/'s and other protocols but I don't know if there are rules for such
protocols.)
Does this sound good? I will send a patch for it soon. (Or, are you already
working on this, Marc?)
> file URL are filtered out by the crawler
> ----------------------------------------
>
> Key: NUTCH-546
> URL: https://issues.apache.org/jira/browse/NUTCH-546
> Project: Nutch
> Issue Type: Bug
> Components: fetcher
> Affects Versions: 1.0.0
> Environment: Windows XP
> Nutch trunk from Monday, August 20th 2007
> Reporter: Marc Brette
>
> I tried to index file system using the file:/ protocol, which worked fine in
> version 0.9
> The file URL are being filtered out and not fetched at all.
> I investigated the code and saw that there are 2 issues:
> 1) One is with the class UrlValidator: when validating an URL, it check the
> 'authority', a combination of host and port. As it is null for file, the URL
> is rejected.
> 2) Once this check is removed, files that contain space characters (and maybe
> other characters to be URL encoded) are also filtered out. It maybe be
> because the file protocol plugin doesn't URL encode space characters and/or
> UrlValidator is enforce the rule to encode such character.
> To workaround these issues, I just commented out UrlValidator checks and it
> works fine.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.