[ https://issues.apache.org/jira/browse/NUTCH-522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Emmanuel Joke updated NUTCH-522: -------------------------------- Attachment: NUTCH-522_v3.patch commons-validator's UrlValidator does not filter URLS with space. They define them as VALID. Thus, I've then updated the UrlValidator to exclude the space ASCII code as you suggested. (Actually there was another way to do that is to modify the URL_PATTERN) I've tested with few links and it looks correct to me: http://autos.yahoo.com/carfinder/?bodystyle=CPE&fuel=Gas&expanded=bodystyle&expanded=fuel is valid http://autos.yahoo.com/carfinder/?bodystyle=CPE&fuel=Gas&expanded=bodystyle& expanded=fuel is not valid http:/ is not valid http://www.variety.com/</div> is not valid http://www.variety.com/</div></a> is not valid mailto:[EMAIL PROTECTED] is not valid http://ad.doubleclick.net/jump/variety.dart/;sz=993x47;ord=' + randomnumber + '? is not valid http://ad.doubleclick.net/ad/variety.dart/;sz=993x47;ord=' + randomnumber + '? is not valid > Use URLValidator in the Injector > -------------------------------- > > Key: NUTCH-522 > URL: https://issues.apache.org/jira/browse/NUTCH-522 > Project: Nutch > Issue Type: Improvement > Components: injector > Reporter: Emmanuel Joke > Assignee: Emmanuel Joke > Priority: Minor > Fix For: 1.0.0 > > Attachments: NUTCH-522.patch, NUTCH-522_v2.patch, NUTCH-522_v3.patch > > > Same as NUTCH-505, we should use the UrlValidator to check url in the Injector -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. ------------------------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now >> http://get.splunk.com/ _______________________________________________ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers