[ 
https://issues.apache.org/jira/browse/NUTCH-522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Emmanuel Joke updated NUTCH-522:
--------------------------------

    Attachment: NUTCH-522_v3.patch

commons-validator's UrlValidator does not filter URLS with space. They define 
them as VALID.

Thus, I've then updated the UrlValidator to exclude the space ASCII code as you 
suggested. (Actually there was another way to do that is to modify the 
URL_PATTERN)

I've tested with few links and it looks correct to me:
http://autos.yahoo.com/carfinder/?bodystyle=CPE&fuel=Gas&expanded=bodystyle&expanded=fuel
 is valid
http://autos.yahoo.com/carfinder/?bodystyle=CPE&fuel=Gas&expanded=bodystyle&; 
expanded=fuel is not valid
http:/ is not valid
http://www.variety.com/</div> is not valid
http://www.variety.com/</div></a> is not valid
mailto:[EMAIL PROTECTED] is not valid
http://ad.doubleclick.net/jump/variety.dart/;sz=993x47;ord=' + randomnumber + 
'? is not valid
http://ad.doubleclick.net/ad/variety.dart/;sz=993x47;ord=' + randomnumber + '? 
is not valid


> Use URLValidator in the Injector
> --------------------------------
>
>                 Key: NUTCH-522
>                 URL: https://issues.apache.org/jira/browse/NUTCH-522
>             Project: Nutch
>          Issue Type: Improvement
>          Components: injector
>            Reporter: Emmanuel Joke
>            Assignee: Emmanuel Joke
>            Priority: Minor
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-522.patch, NUTCH-522_v2.patch, NUTCH-522_v3.patch
>
>
> Same as NUTCH-505, we should use the UrlValidator to check url in the Injector

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to