regex-normalize - Re: SessionIDs and forums are killing my fetch

Jon Shoberg Wed, 28 Sep 2005 04:48:30 -0700

I thought this could be done via regex-normalize? It is my preferenceto use functionality/features of the confuguration rather thanmaintaining a local patch.

-j


Jack Tang wrote:

Hi Jon

Please can see detail in getOutlinks() method in DOMContentUtils class
of parse-html plugin.

you can revise the URLs before

outlinks.add(new Outlink(url.toString(), linkText
                                    .toString().trim()));

Hope it helps

Regards
/Jack

On 9/28/05, Gal Nitzan <[EMAIL PROTECTED]> wrote:

Hi Jack,

How can you discard URL from fetchlist?

Regards,
Gal

Jack Tang wrote:

Hi Jon

I think you can revise the URL by discarding "sid" param before
putting it into fetchlist.

Regards
/Jack

On 9/28/05, Jon Shoberg <[EMAIL PROTECTED]> wrote:

Gal Nitzan wrote:

Jon Shoberg wrote:

I'm getting a ton of duplicate content from a forum with sessionIDs.
Its a phpBB which uses a question mark in the URL and sid.

What have other people done to crawl forums and minimze duplicates?
These are ones that dedup is not catching.

Anyone able to offer how regex-normalize.xml is used. I'm about to
open the source and see...

These URLs look like and appear to have the same content to the user:

http://domain/forum/faq.php?sid=0c5dd3b77aac9d081b2108cfb3dae592
http://domain/forum/faq.php?sid=0c6f7ba099d901c634379994cdacc611
http://domain/forum/login.php?redirect=profile.php&mode=editprofile&sid=aa4d00c2717784c0c8f5a8182ce772ea


Below is my regex normalize file:

<?xml version="1.0"?>
<!-- This is the configuration file for the RegexUrlNormalize Class.
    This is intended so that users can specify substitutions to be
    done on URLs. The regex engine that is used is Perl5 compatible.
    The rules are applied to URLs in the order they occur in this
file.  -->

<!-- WATCH OUT: an xml parser reads this file an ampersands must be
    expanded to &amp; -->

<!-- The following rules show how to strip out session IDs
    that are 32 characters long and have the parameter
    name of PHPSESSID. Order does matter!  -->
<regex-normalize>
<regex>
 <pattern>(\?|\&amp;|\&amp;amp;)PHPSESSID=[a-zA-Z0-9]{32}$</pattern>
 <substitution></substitution>
</regex>
<regex>

<pattern>(\?|\&amp;|\&amp;amp;)PHPSESSID=[a-zA-Z0-9]{32}(\&amp;|\&amp;amp;)(.*)</pattern>

 <substitution>$1$3</substitution>
</regex>
<regex>
 <pattern>(\?|\&amp;|\&amp;amp;)sid=[a-zA-Z0-9]{32}$</pattern>
 <substitution></substitution>
</regex>
<regex>

<pattern>(\?|\&amp;|\&amp;amp;)sid=[a-zA-Z0-9]{32}(\&amp;|\&amp;amp;)(.*)</pattern>

 <substitution>$1$3</substitution>
</regex>
</regex-normalize>

.


Hi Jon,

I'm not sure if the normalize file is the correct place, I use the
regex-urlfiter.xml with the following:

-(session|Session|SESS|sid)

I know it might leave a url like obsession.url out, but it is better
than your fetcher running in circles :-)

Hope it helps,

Gal


Yes,

  Better than circiles but I'm looking to refine the config to allow
for this, not just avoid them.

-j



--
Keep Discovering ... ...
http://www.jroller.com/page/jmars

.



--
Keep Discovering ... ...
http://www.jroller.com/page/jmars

regex-normalize - Re: SessionIDs and forums are killing my fetch

Reply via email to