[
https://issues.apache.org/jira/browse/NUTCH-706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13473825#comment-13473825
]
Hudson commented on NUTCH-706:
------------------------------
Integrated in Nutch-nutchgora #375 (See
[https://builds.apache.org/job/Nutch-nutchgora/375/])
NUTCH-706 (applied correct patch) (Revision 1396822)
NUTCH-706 Url regex normalizer: pattern for session id removal not to match
"newsId" (Revision 1396795)
Result = SUCCESS
snagel :
Files :
* /nutch/branches/2.x/conf/regex-normalize.xml.template
*
/nutch/branches/2.x/src/plugin/urlnormalizer-regex/sample/regex-normalize-default.test
*
/nutch/branches/2.x/src/plugin/urlnormalizer-regex/sample/regex-normalize-default.xml
snagel :
Files :
* /nutch/branches/2.x/CHANGES.txt
* /nutch/branches/2.x/conf/regex-normalize.xml.template
*
/nutch/branches/2.x/src/plugin/urlnormalizer-regex/sample/regex-normalize-default.test
*
/nutch/branches/2.x/src/plugin/urlnormalizer-regex/sample/regex-normalize-default.xml
> Url regex normalizer: default pattern for session id removal not to match
> "newsId"
> ----------------------------------------------------------------------------------
>
> Key: NUTCH-706
> URL: https://issues.apache.org/jira/browse/NUTCH-706
> Project: Nutch
> Issue Type: Bug
> Affects Versions: 1.0.0
> Reporter: Meghna Kukreja
> Priority: Minor
> Fix For: 1.6, 2.2
>
> Attachments: NUTCH-706-2.patch, NUTCH-706.patch
>
>
> Hey,
> I encountered the following problem while trying to crawl a site using
> nutch-trunk. In the file regex-normalize.xml, the following regex is
> used to remove session ids:
> <pattern>([;_]?((?i)l|j|bv_)?((?i)sid|phpsessid|sessionid)=.*?)(\?|&|#|$)</pattern>.
> This pattern also transforms a url, such as,
> "&newsId=2000484784794&newsLang=en" into "&new&newsLang=en" (since it
> matches 'sId' in the 'newsId'), which is incorrect and hence does not
> get fetched. This expression needs to be changed to prevent this.
> Thanks,
> Meghna
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira