[Nutch-dev] Re: [proposal] catching session-id urls

Matt Kangas Mon, 13 Mar 2006 18:10:02 -0800

Thanks for the quick feedback, Stefan! :) Just to clarify the oneexample you provided:

Also you plugin need to be a little bit smarter than just look fordifferent values, sincefoo.com/cms.do?page=1 and foo.com/cms.do?page=2 are two differentpages.

The filter would flag this only if _every_ link returned in the pagethat referred to foo.com contained "page=2". All you need is a singlelink to different page and the filter will let this page pass.

The only case I can really see this happening is if there was nonavigation back at all (to the homepage, etc). In this case, it'slikely a recursive trap anyway.


--Matt

On Mar 13, 2006, at 8:43 PM, Stefan Groschupf wrote:

Hi Matt,
sounds very interesting. Having a extension point until updatingwould be great since it would also allow to implement pluginsdealing with identically pages but different meta data, since wehave not the ability of meta data.So instead of design the api that we give a old and a newmapWritable we can process a crawlDatum from CrawlDb and a new oneform the fetch process.
I see following problems:
Actually the url is used as key so you can not find a old and a newcrawlDatum based on the url in case there are dynamically parameters.Also you plugin need to be a little bit smarter than just look fordifferent values, sincefoo.com/cms.do?page=1 and foo.com/cms.do?page=2 are two differentpages.
What I can imaging is using normalized urls to find identicallypages. Store the parameters as meta data in the crawlDatum, have aplugin that is able to process the crawlDatum from crawlDb and theone from the segment until database update reducing.
Just my 2 cents.
Greetings,
Stefan

Am 14.03.2006 um 02:33 schrieb Matt Kangas:
Hi nutch-dev,
I know that we have RegexUrlNormalizer already for removingsession-ids from URLs, but lately I've been wondering if thereisn't a more general way to solve this, without relying on pre-built patterns.
I think I have an answer that will work. I haven't seen thisapproach published anywhere, so any failings are entirely myfault. ;) What I'm wondering is:- Does this seem like a good (effective, efficient) algorithm forcatching session-id URLs?
- If so, where is the best place to implement it within Nutch?
Basic idea: session ids within URLs only cause problems forcrawlers when they change. This typically occurs when a server-side session expires and a new id is issued. So, rather thanlooking for URL argument patterns (as RegexUrlNormalizer does),look for a value-transition pattern.
Algorithm:

1) Iterate over each page in a fetched segment

2) For each successful fetch, extract:
 - The fetched URL. Call this (u0)
- All links on the page that refer to the same site/domain. Callthis set (u1..N)
3) Parse u0 into parameters (p0) as follows:
 - named parameters: add (key,value) to Map
 - positional (path) params: add (position,value) to Map
So for the url "http://foo.bar/spam/eggs?x=true&y=2";, pseudocodewould look like:
 p0 = new HashMap();
 p0.put(new Integer(1), "spam");
 p0.put(new Integer(2), "eggs");
 p0.put("x", "true");
 p0.put("y", "2");

4) Parse u1..N into (p1..N) using the same method

5) Compare p0 with p1..N. Look for the following pattern:
 - keys that are present for all p0..N, and
 - values that are identical for all p1..N, and
 - the value in p0 is _different_
If you see this condition, flag the page as "contains session idthat just changed" and deal with it accordingly. (Delete fromcrawldb, etc)
So... for anyone who's still reading ;), does this seem like itwould work for catching session-ids? What corner-cases would tripit up? Can you think of cases when it would fall flat? And if itstill seems worthwhile, where's the best place within Nutch to putit? (Perhaps a new ExtensionPoint that is used by "nutch updatedb"?)
--Matt

--
Matt Kangas / [EMAIL PROTECTED]
---------------------------------------------
blog: http://www.find23.org
company: http://www.media-style.com


--
Matt Kangas / [EMAIL PROTECTED]




-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] Re: [proposal] catching session-id urls

Reply via email to