[ https://issues.apache.org/jira/browse/NUTCH-255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andrzej Bialecki closed NUTCH-255. ----------------------------------- Resolution: Duplicate > Regular Expression for RegexUrlNormalizer to remove jsessionid > -------------------------------------------------------------- > > Key: NUTCH-255 > URL: https://issues.apache.org/jira/browse/NUTCH-255 > Project: Nutch > Issue Type: Improvement > Components: fetcher > Affects Versions: 0.8 > Environment: Windows XP Media Center 2005, 2 Gigs RAM, 3.0 Ghz > Pentium 4 Hyperthreaded, Eclipse 3.2.0 > Reporter: Dennis Kubes > Priority: Trivial > Attachments: urlnormalize_jessionid.patch > > > Some URLs are filtered out by the crawl url filter for special characters (by > default). One of these is the jsessionid urls such as: > http://www.somesite.com;jsessionid=A8D7D812B5EFD3099F099A760F779E3B?query=string > We want to get rid of the jessionid and keep everything else so that it looks > like this: > http://www.somesite.com?query=string > Below is a regular expression for the regex-normalize.xml file used by the > RegexUrlNormalizer that sucessfully removes jsessionid strings while leaving > the hostname and querystring. I have also attached a patch for the > regex-normalize.xml.template file that adds the following expression. > <regex> > <pattern>(.*)(;jsessionid=[a-zA-Z0-9]{32})(.*)</pattern> > <substitution>$1$3</substitution> > </regex> -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.