[jira] Commented: (NUTCH-546) file URL are filtered out by the crawler

JIRA Mon, 27 Aug 2007 00:53:51 -0700

    [ 
https://issues.apache.org/jira/browse/NUTCH-546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12522944
 ]


Doğacan Güney commented on NUTCH-546:
-------------------------------------

> I don't know the design decision behind UrlValidator, but why didn't you just 
> instanciate the java URL class ? 

UrlValidator is taken from Apache's commons-validator package and is ported 
over to nutch. We use UrlValidator because Java's URL class is not really 
sufficient for our needs. For example, java's URL class does not throw a 
MalformedURLException for a url like "http://www.example.com/a<div" (nutch's 
parse-js plugin goes over javascript sources to extract urls and sometimes can 
extract urls such as these). Another example is urls with spaces in them. 
Currently, nutch can't fetch (and I believe that it shouldn't fetch) a url if 
it has a space in it so url validation filters it. However, note that url 
validation runs _after_ url normalization. Url normalization is a facility to 
work out various quirks in urls. So one can write a url normalizer that 
normalizes a space to % form which nutch will fetch. 

You may think of UrlValidation as a filter that eliminates invalid urls + 
anything nutch can't fetch. The mistake was that we only considered 
protocol-http and protocol-httpclient plugins (for deciding what nutch can and 
can't fetch) while porting UrlValidator.

I hope this explanation helps. Feel free to add comments if you have more 
questions or something doesn't make sense.

> file URL are filtered out by the crawler
> ----------------------------------------
>
>                 Key: NUTCH-546
>                 URL: https://issues.apache.org/jira/browse/NUTCH-546
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 1.0.0
>         Environment: Windows XP
> Nutch trunk from Monday, August 20th 2007
>            Reporter: Marc Brette
>
> I tried to index file system using the file:/ protocol, which worked fine in 
> version 0.9
> The file URL are being filtered out and not fetched at all.
> I investigated the code and saw that there are 2 issues:
> 1) One is with the class UrlValidator: when validating an URL, it check the 
> 'authority', a combination of host and port. As it is null for file, the URL 
> is rejected.
> 2) Once this check is removed, files that contain space characters (and maybe 
> other characters to be URL encoded) are also filtered out. It maybe be 
> because the file protocol plugin doesn't URL encode space characters and/or 
> UrlValidator is enforce the rule to encode such character.
> To workaround these issues, I just commented out UrlValidator checks and it 
> works fine.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-546) file URL are filtered out by the crawler

Reply via email to