Just incase anyone is looking for rules to remove session IDs
Following are rules that will remove duplicate pages from being indexed.
These need to be added to regex-normalize.xml
These rule cover about 80% of the cases I've seen on the net.
If anyone wants to send me URLs to other implementation I'd be happy to test
and create the rules.
Can someone add them to the sample file in CVS?
Caution: your email may add line breaks please remove when you paste.
-------------------------------------
<!-- Rule to remove the "rand" key that is sometimes there -->
<regex>
<pattern>(.*)rand=[a-zA-Z0-9]*(.*)</pattern>
<substitution>$1$2</substitution>
</regex>
<!-- Rule to remove Anchor tags -->
<regex>
<pattern>(.*)#.*</pattern>
<substitution>$1</substitution>
</regex>
<!-- Rule to remove Amazon sessions -->
<regex>
<pattern>(.*\.amazon\..*)[0-9]{3}\-[0-9]{7}\-[0-9]{7}(.*)</pattern>
<substitution>$1$2</substitution>
</regex>
<!-- Rule to remove other session 32 chars in length -->
<regex>
<pattern>(\?|\&|\&amp;)[a-zA-Z0-9]+=[a-zA-Z0-9]{32}$</pattern>
<substitution></substitution>
</regex>
<regex>
<pattern>(\?|\&|\&amp;)[a-zA-Z0-9]+=[a-zA-Z0-9]{32}(\&|\&amp
;)(.*)</pattern>
<substitution>$1$3</substitution>
</regex>
-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://productguide.itmanagersjournal.com/
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers