Re: Url regex normalizer

Andrzej Bialecki Fri, 27 Feb 2009 09:10:48 -0800

Meghna Kukreja wrote:

Hey,


I encountered the following problem while trying to crawl a site using
nutch-trunk. In the file regex-normalize.xml, the following regex is
used to remove session ids:

<pattern>([;_]?((?i)l|j|bv_)?((?i)sid|phpsessid|sessionid)=.*?)(\?|&amp;|#|$)</pattern>.

This pattern also transforms a url, such as,
"&newsId=2000484784794&newsLang=en" into "&new&newsLang=en" (since it
matches 'sId' in the 'newsId'), which is incorrect and hence does not
get fetched. This expression needs to be changed to prevent this.

I agree. Can you please create a JIRA issue with this information, andpropose a patch to fix this?



--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Url regex normalizer

Reply via email to