[ 
https://issues.apache.org/jira/browse/NUTCH-522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12514144
 ] 

Doğacan Güney commented on NUTCH-522:
-------------------------------------

> Oops, my mistake. Please find an updated patch. 

This patch looks good.

> For instance: http://lucene.apache.org/jira/browse.jsp?itemid=500 &sort=up
> A space between 500 and & has been accepted.
> Is it normal ? 
>
> I really want to exclude thos kind of URL. 

UrlValidator is meant to eliminate anything nutch can't fetch. So, if fetcher 
fails while trying to fetch that url, that UrlValidator should have eliminated 
it and it is a bug.

[...snip...]
> It includes an option to disallow FRAGMENTS. Why don't we have this version 
> in nutch ?

Because urlfilters can already do that, so I didn't want to duplicate 
functionality. UrlValidator eliminates invalid urls, then urlnormalizers and 
urlfilters decide what to do with it. You can remove fragments or skip url with 
fragments.

> Use URLValidator in the Injector
> --------------------------------
>
>                 Key: NUTCH-522
>                 URL: https://issues.apache.org/jira/browse/NUTCH-522
>             Project: Nutch
>          Issue Type: Improvement
>          Components: injector
>            Reporter: Emmanuel Joke
>            Priority: Minor
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-522.patch, NUTCH-522_v2.patch
>
>
> Same as NUTCH-505, we should use the UrlValidator to check url in the Injector

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to