Hi Jon

I think you can revise the URL by discarding "sid" param before
putting it into fetchlist.

Regards
/Jack

On 9/28/05, Jon Shoberg <[EMAIL PROTECTED]> wrote:
> Gal Nitzan wrote:
> > Jon Shoberg wrote:
> >
> >> I'm getting a ton of duplicate content from a forum with sessionIDs.
> >> Its a phpBB which uses a question mark in the URL and sid.
> >>
> >> What have other people done to crawl forums and minimze duplicates?
> >> These are ones that dedup is not catching.
> >>
> >> Anyone able to offer how regex-normalize.xml is used. I'm about to
> >> open the source and see...
> >>
> >> These URLs look like and appear to have the same content to the user:
> >>
> >> http://domain/forum/faq.php?sid=0c5dd3b77aac9d081b2108cfb3dae592
> >> http://domain/forum/faq.php?sid=0c6f7ba099d901c634379994cdacc611
> >> http://domain/forum/login.php?redirect=profile.php&mode=editprofile&sid=aa4d00c2717784c0c8f5a8182ce772ea
> >>
> >>
> >> Below is my regex normalize file:
> >>
> >> <?xml version="1.0"?>
> >> <!-- This is the configuration file for the RegexUrlNormalize Class.
> >>      This is intended so that users can specify substitutions to be
> >>      done on URLs. The regex engine that is used is Perl5 compatible.
> >>      The rules are applied to URLs in the order they occur in this
> >> file.  -->
> >>
> >> <!-- WATCH OUT: an xml parser reads this file an ampersands must be
> >>      expanded to &amp; -->
> >>
> >> <!-- The following rules show how to strip out session IDs
> >>      that are 32 characters long and have the parameter
> >>      name of PHPSESSID. Order does matter!  -->
> >> <regex-normalize>
> >> <regex>
> >>   <pattern>(\?|\&amp;|\&amp;amp;)PHPSESSID=[a-zA-Z0-9]{32}$</pattern>
> >>   <substitution></substitution>
> >> </regex>
> >> <regex>
> >>
> >> <pattern>(\?|\&amp;|\&amp;amp;)PHPSESSID=[a-zA-Z0-9]{32}(\&amp;|\&amp;amp;)(.*)</pattern>
> >>
> >>   <substitution>$1$3</substitution>
> >> </regex>
> >> <regex>
> >>   <pattern>(\?|\&amp;|\&amp;amp;)sid=[a-zA-Z0-9]{32}$</pattern>
> >>   <substitution></substitution>
> >> </regex>
> >> <regex>
> >>
> >> <pattern>(\?|\&amp;|\&amp;amp;)sid=[a-zA-Z0-9]{32}(\&amp;|\&amp;amp;)(.*)</pattern>
> >>
> >>   <substitution>$1$3</substitution>
> >> </regex>
> >> </regex-normalize>
> >>
> >> .
> >>
> >
> > Hi Jon,
> >
> > I'm not sure if the normalize file is the correct place, I use the
> > regex-urlfiter.xml with the following:
> >
> > -(session|Session|SESS|sid)
> >
> > I know it might leave a url like obsession.url out, but it is better
> > than your fetcher running in circles :-)
> >
> > Hope it helps,
> >
> > Gal
>
> Yes,
>
>    Better than circiles but I'm looking to refine the config to allow
> for this, not just avoid them.
>
> -j
>


--
Keep Discovering ... ...
http://www.jroller.com/page/jmars

Reply via email to