[
https://issues.apache.org/jira/browse/NUTCH-522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Emmanuel Joke updated NUTCH-522:
--------------------------------
Attachment: NUTCH-522_v3.patch
commons-validator's UrlValidator does not filter URLS with space. They define
them as VALID.
Thus, I've then updated the UrlValidator to exclude the space ASCII code as you
suggested. (Actually there was another way to do that is to modify the
URL_PATTERN)
I've tested with few links and it looks correct to me:
http://autos.yahoo.com/carfinder/?bodystyle=CPE&fuel=Gas&expanded=bodystyle&expanded=fuel
is valid
http://autos.yahoo.com/carfinder/?bodystyle=CPE&fuel=Gas&expanded=bodystyle&
expanded=fuel is not valid
http:/ is not valid
http://www.variety.com/</div> is not valid
http://www.variety.com/</div></a> is not valid
mailto:[EMAIL PROTECTED] is not valid
http://ad.doubleclick.net/jump/variety.dart/;sz=993x47;ord=' + randomnumber +
'? is not valid
http://ad.doubleclick.net/ad/variety.dart/;sz=993x47;ord=' + randomnumber + '?
is not valid
> Use URLValidator in the Injector
> --------------------------------
>
> Key: NUTCH-522
> URL: https://issues.apache.org/jira/browse/NUTCH-522
> Project: Nutch
> Issue Type: Improvement
> Components: injector
> Reporter: Emmanuel Joke
> Assignee: Emmanuel Joke
> Priority: Minor
> Fix For: 1.0.0
>
> Attachments: NUTCH-522.patch, NUTCH-522_v2.patch, NUTCH-522_v3.patch
>
>
> Same as NUTCH-505, we should use the UrlValidator to check url in the Injector
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems? Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers