[jira] Commented: (NUTCH-706) Url regex normalizer
[ https://issues.apache.org/jira/browse/NUTCH-706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12851923#action_12851923 ] Ken Krugler commented on NUTCH-706: --- Two comments about this: 1. From my experiences with Nutch Bixo, I think that URL normalization ultimately needs to be more structured - ie first break the URL into pieces, then apply rules against the pieces. Trying to craft regular expressions to handle target cases leads to big, hairy, hard-to-understand strings. 2. URL normalization is something that makes a lot of sense for crawler-commons. If somebody from the Nutch side wants to define a target API, I could look at porting existing Bixo code to crawler-commons. Url regex normalizer Key: NUTCH-706 URL: https://issues.apache.org/jira/browse/NUTCH-706 Project: Nutch Issue Type: Bug Affects Versions: 1.0.0 Reporter: Meghna Kukreja Priority: Minor Hey, I encountered the following problem while trying to crawl a site using nutch-trunk. In the file regex-normalize.xml, the following regex is used to remove session ids: pattern([;_]?((?i)l|j|bv_)?((?i)sid|phpsessid|sessionid)=.*?)(\?|amp;|#|$)/pattern. This pattern also transforms a url, such as, newsId=2000484784794newsLang=en into newnewsLang=en (since it matches 'sId' in the 'newsId'), which is incorrect and hence does not get fetched. This expression needs to be changed to prevent this. Thanks, Meghna -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-706) Url regex normalizer
[ https://issues.apache.org/jira/browse/NUTCH-706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12689385#action_12689385 ] Dmitry Lihachev commented on NUTCH-706: --- I think this must be changed to {code:xml} regex pattern(\?|amp;)([;_]?((?i)l|j|bv_|ps_)?((?i)s|sid|phpsessid|sessionid|conversationid|sess_id)=.*?)(\?|amp;|#|$)/pattern substitution$1$5/substitution /regex {code} Url regex normalizer Key: NUTCH-706 URL: https://issues.apache.org/jira/browse/NUTCH-706 Project: Nutch Issue Type: Bug Affects Versions: 1.0.0 Reporter: Meghna Kukreja Priority: Minor Fix For: 1.1 Hey, I encountered the following problem while trying to crawl a site using nutch-trunk. In the file regex-normalize.xml, the following regex is used to remove session ids: pattern([;_]?((?i)l|j|bv_)?((?i)sid|phpsessid|sessionid)=.*?)(\?|amp;|#|$)/pattern. This pattern also transforms a url, such as, newsId=2000484784794newsLang=en into newnewsLang=en (since it matches 'sId' in the 'newsId'), which is incorrect and hence does not get fetched. This expression needs to be changed to prevent this. Thanks, Meghna -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-706) Url regex normalizer
[ https://issues.apache.org/jira/browse/NUTCH-706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12677460#action_12677460 ] Meghna Kukreja commented on NUTCH-706: -- The pattern should be changed to: pattern([;_\?amp;]((?i)l|j|bv_)?((?i)sid|phpsessid|sessionid)=.*?)(\?|amp;|#|$)/pattern Can someone please verify this? I am not very good with regular expressions :) Url regex normalizer Key: NUTCH-706 URL: https://issues.apache.org/jira/browse/NUTCH-706 Project: Nutch Issue Type: Bug Affects Versions: 1.0.0 Reporter: Meghna Kukreja Priority: Minor Fix For: 1.0.0 Hey, I encountered the following problem while trying to crawl a site using nutch-trunk. In the file regex-normalize.xml, the following regex is used to remove session ids: pattern([;_]?((?i)l|j|bv_)?((?i)sid|phpsessid|sessionid)=.*?)(\?|amp;|#|$)/pattern. This pattern also transforms a url, such as, newsId=2000484784794newsLang=en into newnewsLang=en (since it matches 'sId' in the 'newsId'), which is incorrect and hence does not get fetched. This expression needs to be changed to prevent this. Thanks, Meghna -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.