Gal Nitzan wrote:
Jon Shoberg wrote:
I'm getting a ton of duplicate content from a forum with sessionIDs.
Its a phpBB which uses a question mark in the URL and sid.
What have other people done to crawl forums and minimze duplicates?
These are ones that dedup is not catching.
Anyone able to offer how regex-normalize.xml is used. I'm about to
open the source and see...
These URLs look like and appear to have the same content to the user:
http://domain/forum/faq.php?sid=0c5dd3b77aac9d081b2108cfb3dae592
http://domain/forum/faq.php?sid=0c6f7ba099d901c634379994cdacc611
http://domain/forum/login.php?redirect=profile.php&mode=editprofile&sid=aa4d00c2717784c0c8f5a8182ce772ea
Below is my regex normalize file:
<?xml version="1.0"?>
<!-- This is the configuration file for the RegexUrlNormalize Class.
This is intended so that users can specify substitutions to be
done on URLs. The regex engine that is used is Perl5 compatible.
The rules are applied to URLs in the order they occur in this
file. -->
<!-- WATCH OUT: an xml parser reads this file an ampersands must be
expanded to & -->
<!-- The following rules show how to strip out session IDs
that are 32 characters long and have the parameter
name of PHPSESSID. Order does matter! -->
<regex-normalize>
<regex>
<pattern>(\?|\&|\&amp;)PHPSESSID=[a-zA-Z0-9]{32}$</pattern>
<substitution></substitution>
</regex>
<regex>
<pattern>(\?|\&|\&amp;)PHPSESSID=[a-zA-Z0-9]{32}(\&|\&amp;)(.*)</pattern>
<substitution>$1$3</substitution>
</regex>
<regex>
<pattern>(\?|\&|\&amp;)sid=[a-zA-Z0-9]{32}$</pattern>
<substitution></substitution>
</regex>
<regex>
<pattern>(\?|\&|\&amp;)sid=[a-zA-Z0-9]{32}(\&|\&amp;)(.*)</pattern>
<substitution>$1$3</substitution>
</regex>
</regex-normalize>
.
Hi Jon,
I'm not sure if the normalize file is the correct place, I use the
regex-urlfiter.xml with the following:
-(session|Session|SESS|sid)
I know it might leave a url like obsession.url out, but it is better
than your fetcher running in circles :-)
Hope it helps,
Gal
Yes,
Better than circiles but I'm looking to refine the config to allow
for this, not just avoid them.
-j