Meghna Kukreja wrote:
Hey,
I encountered the following problem while trying to crawl a site using
nutch-trunk. In the file regex-normalize.xml, the following regex is
used to remove session ids:
<pattern>([;_]?((?i)l|j|bv_)?((?i)sid|phpsessid|sessionid)=.*?)(\?|&|#|$)</pattern>.
This pattern also transforms a url, such as,
"&newsId=2000484784794&newsLang=en" into "&new&newsLang=en" (since it
matches 'sId' in the 'newsId'), which is incorrect and hence does not
get fetched. This expression needs to be changed to prevent this.
I agree. Can you please create a JIRA issue with this information, and
propose a patch to fix this?
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com