[ 
https://issues.apache.org/jira/browse/NUTCH-522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12517232
 ] 

Doğacan Güney commented on NUTCH-522:
-------------------------------------

> I tried with protocol-http and protocol-httpclient, i got the same error when 
> the url contained some space.
> I'm afraid it didn't change anything. 

Actually, it is good news :). This means we can update the url pattern to 
exclude urls with spaces in it.

> I think you're right about the order, the normalizer should come first.

Btw, this is already what we do in ParseOutputFormat. Urls are normalized in 
Outlink's constructor, then validated and filtered in ParseOutputFormat. 

So, I am going to reverse validator/normalizer order in your patch and commit 
it soon.

> Use URLValidator in the Injector
> --------------------------------
>
>                 Key: NUTCH-522
>                 URL: https://issues.apache.org/jira/browse/NUTCH-522
>             Project: Nutch
>          Issue Type: Improvement
>          Components: injector
>            Reporter: Emmanuel Joke
>            Assignee: Emmanuel Joke
>            Priority: Minor
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-522.patch, NUTCH-522_v2.patch, NUTCH-522_v3.patch
>
>
> Same as NUTCH-505, we should use the UrlValidator to check url in the Injector

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to