[ https://issues.apache.org/jira/browse/NUTCH-522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515991 ]
Doğacan Güney edited comment on NUTCH-522 at 7/27/07 6:10 AM: -------------------------------------------------------------- Btw, I thought about validation stuff a bit and IMHO it is better to run normalizers before UrlValidator (so the new order is normalize, validate, filter). It is possible that someone writes a normalizer that replaces spaces with %20s (so it becomes a valid url). If we have such a normalizer, we should run it before validation so that it will pass validation (and IMO, it should pass validation since nutch can fetch a url with %20's) I think your patch looks good, but I will wait a while to hopefully get some comments on putting normalizers before validator. was: Btw, I though about validation stuff a bit and IMHO it is better to run normalizers before UrlValidator (so the new order is normalize, validate, filter). It is possible that someone writes a normalizer that replaces spaces with %20s (so it becomes a valid url). If we have such a normalizer, we should run it before validation so that it will pass validation (and IMO, it should pass validation since nutch can fetch a url with %20's) I think your patch looks good, but I will wait a while to hopefully get some comments on putting normalizers before validator. > Use URLValidator in the Injector > -------------------------------- > > Key: NUTCH-522 > URL: https://issues.apache.org/jira/browse/NUTCH-522 > Project: Nutch > Issue Type: Improvement > Components: injector > Reporter: Emmanuel Joke > Assignee: Emmanuel Joke > Priority: Minor > Fix For: 1.0.0 > > Attachments: NUTCH-522.patch, NUTCH-522_v2.patch, NUTCH-522_v3.patch > > > Same as NUTCH-505, we should use the UrlValidator to check url in the Injector -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. ------------------------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now >> http://get.splunk.com/ _______________________________________________ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers