[Nutch-dev] [jira] Commented: (NUTCH-522) Use URLValidator in the Injector
[ https://issues.apache.org/jira/browse/NUTCH-522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12517232 ] Doğacan Güney commented on NUTCH-522: - I tried with protocol-http and protocol-httpclient, i got the same error when the url contained some space. I'm afraid it didn't change anything. Actually, it is good news :). This means we can update the url pattern to exclude urls with spaces in it. I think you're right about the order, the normalizer should come first. Btw, this is already what we do in ParseOutputFormat. Urls are normalized in Outlink's constructor, then validated and filtered in ParseOutputFormat. So, I am going to reverse validator/normalizer order in your patch and commit it soon. Use URLValidator in the Injector Key: NUTCH-522 URL: https://issues.apache.org/jira/browse/NUTCH-522 Project: Nutch Issue Type: Improvement Components: injector Reporter: Emmanuel Joke Assignee: Emmanuel Joke Priority: Minor Fix For: 1.0.0 Attachments: NUTCH-522.patch, NUTCH-522_v2.patch, NUTCH-522_v3.patch Same as NUTCH-505, we should use the UrlValidator to check url in the Injector -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now http://get.splunk.com/ ___ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers
[Nutch-dev] [jira] Commented: (NUTCH-522) Use URLValidator in the Injector
[ https://issues.apache.org/jira/browse/NUTCH-522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515991 ] Doğacan Güney commented on NUTCH-522: - Btw, I though about validation stuff a bit and IMHO it is better to run normalizers before UrlValidator (so the new order is normalize, validate, filter). It is possible that someone writes a normalizer that replaces spaces with %20s (so it becomes a valid url). If we have such a normalizer, we should run it before validation so that it will pass validation (and IMO, it should pass validation since nutch can fetch a url with %20's) I think your patch looks good, but I will wait a while to hopefully get some comments on putting normalizers before validator. Use URLValidator in the Injector Key: NUTCH-522 URL: https://issues.apache.org/jira/browse/NUTCH-522 Project: Nutch Issue Type: Improvement Components: injector Reporter: Emmanuel Joke Assignee: Emmanuel Joke Priority: Minor Fix For: 1.0.0 Attachments: NUTCH-522.patch, NUTCH-522_v2.patch, NUTCH-522_v3.patch Same as NUTCH-505, we should use the UrlValidator to check url in the Injector -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now http://get.splunk.com/ ___ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers
[Nutch-dev] [jira] Commented: (NUTCH-522) Use URLValidator in the Injector
[ https://issues.apache.org/jira/browse/NUTCH-522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515992 ] Doğacan Güney commented on NUTCH-522: - I forgot to ask: are you using protocol-http or protocol-httpclient? It is possible that httpclient does some sort of normalization before requesting a url, so (maybe) it can fetch a url like: http://autos.yahoo.com/carfinder/?bodystyle=CPEfuel=Gasexpanded=bodystyle; expanded=fuel or maybe it can't :) . Use URLValidator in the Injector Key: NUTCH-522 URL: https://issues.apache.org/jira/browse/NUTCH-522 Project: Nutch Issue Type: Improvement Components: injector Reporter: Emmanuel Joke Assignee: Emmanuel Joke Priority: Minor Fix For: 1.0.0 Attachments: NUTCH-522.patch, NUTCH-522_v2.patch, NUTCH-522_v3.patch Same as NUTCH-505, we should use the UrlValidator to check url in the Injector -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now http://get.splunk.com/ ___ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers
[Nutch-dev] [jira] Commented: (NUTCH-522) Use URLValidator in the Injector
[ https://issues.apache.org/jira/browse/NUTCH-522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516138 ] Emmanuel Joke commented on NUTCH-522: - I tried with protocol-http and protocol-httpclient, i got the same error when the url contained some space. I'm afraid it didn't change anything. I think you're right about the order, the normalizer should come first. Use URLValidator in the Injector Key: NUTCH-522 URL: https://issues.apache.org/jira/browse/NUTCH-522 Project: Nutch Issue Type: Improvement Components: injector Reporter: Emmanuel Joke Assignee: Emmanuel Joke Priority: Minor Fix For: 1.0.0 Attachments: NUTCH-522.patch, NUTCH-522_v2.patch, NUTCH-522_v3.patch Same as NUTCH-505, we should use the UrlValidator to check url in the Injector -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now http://get.splunk.com/ ___ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers
[Nutch-dev] [jira] Commented: (NUTCH-522) Use URLValidator in the Injector
[ https://issues.apache.org/jira/browse/NUTCH-522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12514144 ] Doğacan Güney commented on NUTCH-522: - Oops, my mistake. Please find an updated patch. This patch looks good. For instance: http://lucene.apache.org/jira/browse.jsp?itemid=500 sort=up A space between 500 and has been accepted. Is it normal ? I really want to exclude thos kind of URL. UrlValidator is meant to eliminate anything nutch can't fetch. So, if fetcher fails while trying to fetch that url, that UrlValidator should have eliminated it and it is a bug. [...snip...] It includes an option to disallow FRAGMENTS. Why don't we have this version in nutch ? Because urlfilters can already do that, so I didn't want to duplicate functionality. UrlValidator eliminates invalid urls, then urlnormalizers and urlfilters decide what to do with it. You can remove fragments or skip url with fragments. Use URLValidator in the Injector Key: NUTCH-522 URL: https://issues.apache.org/jira/browse/NUTCH-522 Project: Nutch Issue Type: Improvement Components: injector Reporter: Emmanuel Joke Priority: Minor Fix For: 1.0.0 Attachments: NUTCH-522.patch, NUTCH-522_v2.patch Same as NUTCH-505, we should use the UrlValidator to check url in the Injector -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2005. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers
[Nutch-dev] [jira] Commented: (NUTCH-522) Use URLValidator in the Injector
[ https://issues.apache.org/jira/browse/NUTCH-522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12514153 ] Emmanuel Joke commented on NUTCH-522: - Actually I tried to fetch the url http://autos.yahoo.com/carfinder/?bodystyle=CPE fuel=Gas expanded=bodystyle expanded=fuel and it didn't work within Nutch. But if you remove the space: http://autos.yahoo.com/carfinder/?bodystyle=CPEfuel=Gasexpanded=bodystyleexpanded=fuel, it does work perfectly. So, I guess we have to add a new check regarding space in URLs. any idea ? Use URLValidator in the Injector Key: NUTCH-522 URL: https://issues.apache.org/jira/browse/NUTCH-522 Project: Nutch Issue Type: Improvement Components: injector Reporter: Emmanuel Joke Priority: Minor Fix For: 1.0.0 Attachments: NUTCH-522.patch, NUTCH-522_v2.patch Same as NUTCH-505, we should use the UrlValidator to check url in the Injector -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2005. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers
[Nutch-dev] [jira] Commented: (NUTCH-522) Use URLValidator in the Injector
[ https://issues.apache.org/jira/browse/NUTCH-522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12514157 ] Doğacan Güney commented on NUTCH-522: - So, I guess we have to add a new check regarding space in URLs. any idea ? OK, it is a bug then. I would suggest that you add a main method to UrlValidator ( like this one: http://www.ceng.metu.edu.tr/~e1345172/validator_main.patch ), then debug UrlValidator to check why it accepts it. Also, if commons-validator's UrlValidator filters that url, you can debug original UrlValidator to see where it invalidates it. My guess is that it may be related to LEGAL_ASCII_PATTERN. I couldn't get original validator's LEGAL_ASCII_PATTERN to work with java.util.regex so I wrote a new pattern, but I thought the new pattern was stricter than the old one. Use URLValidator in the Injector Key: NUTCH-522 URL: https://issues.apache.org/jira/browse/NUTCH-522 Project: Nutch Issue Type: Improvement Components: injector Reporter: Emmanuel Joke Priority: Minor Fix For: 1.0.0 Attachments: NUTCH-522.patch, NUTCH-522_v2.patch Same as NUTCH-505, we should use the UrlValidator to check url in the Injector -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2005. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers
[Nutch-dev] [jira] Commented: (NUTCH-522) Use URLValidator in the Injector
[ https://issues.apache.org/jira/browse/NUTCH-522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12513895 ] Doğacan Güney commented on NUTCH-522: - I like the idea, but your patch seems to have a bug. Now injector only injects a url if it is *not* valid. Injector.java:75: if (!validator.isValid(url)) { I think you should put a return there instead of moving normalizing and filtering code into that branch. Use URLValidator in the Injector Key: NUTCH-522 URL: https://issues.apache.org/jira/browse/NUTCH-522 Project: Nutch Issue Type: Improvement Components: injector Reporter: Emmanuel Joke Priority: Minor Fix For: 1.0.0 Attachments: NUTCH-522.patch Same as NUTCH-505, we should use the UrlValidator to check url in the Injector -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers