Hi Jon I think you can revise the URL by discarding "sid" param before putting it into fetchlist.
Regards /Jack On 9/28/05, Jon Shoberg <[EMAIL PROTECTED]> wrote: > Gal Nitzan wrote: > > Jon Shoberg wrote: > > > >> I'm getting a ton of duplicate content from a forum with sessionIDs. > >> Its a phpBB which uses a question mark in the URL and sid. > >> > >> What have other people done to crawl forums and minimze duplicates? > >> These are ones that dedup is not catching. > >> > >> Anyone able to offer how regex-normalize.xml is used. I'm about to > >> open the source and see... > >> > >> These URLs look like and appear to have the same content to the user: > >> > >> http://domain/forum/faq.php?sid=0c5dd3b77aac9d081b2108cfb3dae592 > >> http://domain/forum/faq.php?sid=0c6f7ba099d901c634379994cdacc611 > >> http://domain/forum/login.php?redirect=profile.php&mode=editprofile&sid=aa4d00c2717784c0c8f5a8182ce772ea > >> > >> > >> Below is my regex normalize file: > >> > >> <?xml version="1.0"?> > >> <!-- This is the configuration file for the RegexUrlNormalize Class. > >> This is intended so that users can specify substitutions to be > >> done on URLs. The regex engine that is used is Perl5 compatible. > >> The rules are applied to URLs in the order they occur in this > >> file. --> > >> > >> <!-- WATCH OUT: an xml parser reads this file an ampersands must be > >> expanded to & --> > >> > >> <!-- The following rules show how to strip out session IDs > >> that are 32 characters long and have the parameter > >> name of PHPSESSID. Order does matter! --> > >> <regex-normalize> > >> <regex> > >> <pattern>(\?|\&|\&amp;)PHPSESSID=[a-zA-Z0-9]{32}$</pattern> > >> <substitution></substitution> > >> </regex> > >> <regex> > >> > >> <pattern>(\?|\&|\&amp;)PHPSESSID=[a-zA-Z0-9]{32}(\&|\&amp;)(.*)</pattern> > >> > >> <substitution>$1$3</substitution> > >> </regex> > >> <regex> > >> <pattern>(\?|\&|\&amp;)sid=[a-zA-Z0-9]{32}$</pattern> > >> <substitution></substitution> > >> </regex> > >> <regex> > >> > >> <pattern>(\?|\&|\&amp;)sid=[a-zA-Z0-9]{32}(\&|\&amp;)(.*)</pattern> > >> > >> <substitution>$1$3</substitution> > >> </regex> > >> </regex-normalize> > >> > >> . > >> > > > > Hi Jon, > > > > I'm not sure if the normalize file is the correct place, I use the > > regex-urlfiter.xml with the following: > > > > -(session|Session|SESS|sid) > > > > I know it might leave a url like obsession.url out, but it is better > > than your fetcher running in circles :-) > > > > Hope it helps, > > > > Gal > > Yes, > > Better than circiles but I'm looking to refine the config to allow > for this, not just avoid them. > > -j > -- Keep Discovering ... ... http://www.jroller.com/page/jmars
