Re: SessionIDs and forums are killing my fetch

Jack Tang Wed, 28 Sep 2005 04:24:02 -0700

Hi Jon

Please can see detail in getOutlinks() method in DOMContentUtils class
of parse-html plugin.


you can revise the URLs before

outlinks.add(new Outlink(url.toString(), linkText
                                    .toString().trim()));

Hope it helps

Regards
/Jack

On 9/28/05, Gal Nitzan <[EMAIL PROTECTED]> wrote:
> Hi Jack,
>
> How can you discard URL from fetchlist?
>
> Regards,
> Gal
>
> Jack Tang wrote:
> > Hi Jon
> >
> > I think you can revise the URL by discarding "sid" param before
> > putting it into fetchlist.
> >
> > Regards
> > /Jack
> >
> > On 9/28/05, Jon Shoberg <[EMAIL PROTECTED]> wrote:
> >
> >> Gal Nitzan wrote:
> >>
> >>> Jon Shoberg wrote:
> >>>
> >>>
> >>>> I'm getting a ton of duplicate content from a forum with sessionIDs.
> >>>> Its a phpBB which uses a question mark in the URL and sid.
> >>>>
> >>>> What have other people done to crawl forums and minimze duplicates?
> >>>> These are ones that dedup is not catching.
> >>>>
> >>>> Anyone able to offer how regex-normalize.xml is used. I'm about to
> >>>> open the source and see...
> >>>>
> >>>> These URLs look like and appear to have the same content to the user:
> >>>>
> >>>> http://domain/forum/faq.php?sid=0c5dd3b77aac9d081b2108cfb3dae592
> >>>> http://domain/forum/faq.php?sid=0c6f7ba099d901c634379994cdacc611
> >>>> http://domain/forum/login.php?redirect=profile.php&mode=editprofile&sid=aa4d00c2717784c0c8f5a8182ce772ea
> >>>>
> >>>>
> >>>> Below is my regex normalize file:
> >>>>
> >>>> <?xml version="1.0"?>
> >>>> <!-- This is the configuration file for the RegexUrlNormalize Class.
> >>>>      This is intended so that users can specify substitutions to be
> >>>>      done on URLs. The regex engine that is used is Perl5 compatible.
> >>>>      The rules are applied to URLs in the order they occur in this
> >>>> file.  -->
> >>>>
> >>>> <!-- WATCH OUT: an xml parser reads this file an ampersands must be
> >>>>      expanded to &amp; -->
> >>>>
> >>>> <!-- The following rules show how to strip out session IDs
> >>>>      that are 32 characters long and have the parameter
> >>>>      name of PHPSESSID. Order does matter!  -->
> >>>> <regex-normalize>
> >>>> <regex>
> >>>>   <pattern>(\?|\&amp;|\&amp;amp;)PHPSESSID=[a-zA-Z0-9]{32}$</pattern>
> >>>>   <substitution></substitution>
> >>>> </regex>
> >>>> <regex>
> >>>>
> >>>> <pattern>(\?|\&amp;|\&amp;amp;)PHPSESSID=[a-zA-Z0-9]{32}(\&amp;|\&amp;amp;)(.*)</pattern>
> >>>>
> >>>>   <substitution>$1$3</substitution>
> >>>> </regex>
> >>>> <regex>
> >>>>   <pattern>(\?|\&amp;|\&amp;amp;)sid=[a-zA-Z0-9]{32}$</pattern>
> >>>>   <substitution></substitution>
> >>>> </regex>
> >>>> <regex>
> >>>>
> >>>> <pattern>(\?|\&amp;|\&amp;amp;)sid=[a-zA-Z0-9]{32}(\&amp;|\&amp;amp;)(.*)</pattern>
> >>>>
> >>>>   <substitution>$1$3</substitution>
> >>>> </regex>
> >>>> </regex-normalize>
> >>>>
> >>>> .
> >>>>
> >>>>
> >>> Hi Jon,
> >>>
> >>> I'm not sure if the normalize file is the correct place, I use the
> >>> regex-urlfiter.xml with the following:
> >>>
> >>> -(session|Session|SESS|sid)
> >>>
> >>> I know it might leave a url like obsession.url out, but it is better
> >>> than your fetcher running in circles :-)
> >>>
> >>> Hope it helps,
> >>>
> >>> Gal
> >>>
> >> Yes,
> >>
> >>    Better than circiles but I'm looking to refine the config to allow
> >> for this, not just avoid them.
> >>
> >> -j
> >>
> >>
> >
> >
> > --
> > Keep Discovering ... ...
> > http://www.jroller.com/page/jmars
> >
> > .
> >
> >
>
>


--
Keep Discovering ... ...
http://www.jroller.com/page/jmars

Re: SessionIDs and forums are killing my fetch

Reply via email to