RE: Fetching pages with query strings

Nick Temple Tue, 16 Aug 2005 13:22:48 -0700

True, fetching and indexing in Nutch isn't quite the same thing, is it?
Maybe I should have said "nearly duplicate content" and "long crawls" .. the
nearly dupicate stuff comes from the exact same content being sorted
multiple different ways, as in the case of a standard Apache index page.


Your point about the sessionid's, timestamps and others are right on spot,
too - I had forgotten about them.

Maybe it would make sense for people to begin putting together a list of
regexes that excludes the most common problematic query strings?

Nick

-----Original Message-----
From: EM [mailto:[EMAIL PROTECTED]
Sent: Tuesday, August 16, 2005 3:53 PM
To: [email protected]
Subject: RE: Fetching pages with query strings


Actually, duplicates are excluded on the end, so they aren't kept at all,
but one will end up *fetching* a lot of duplicated content constantly.

-----Original Message-----
From: Nick Temple [mailto:[EMAIL PROTECTED]
Sent: Tuesday, August 16, 2005 3:52 PM
To: [email protected]
Subject: RE: Fetching pages with query strings

Hi Bryan --

When using a previos system, I did begin to exclude sites with query
strings. Some common software uses query strings to create different views
of the same data ... I ended up crawling one bulletin board for days - I
believe at this point that there was something recursive in nature about
some sites.  The fact that a site does not use a query string does not, by
itself, mean that the site isn't recursive.

I'm not sure if Nutch itself has an internal limitation on the
representation of URL's, if so, then that would be the only reason I can
think of to exclude query strings entirely -- but watch your crawls, and
begin to ban recursive systems, or you may end up with a lot of duplicate
content.

Nick

-----Original Message-----
From: Bryan Woliner [mailto:[EMAIL PROTECTED]
Sent: Tuesday, August 16, 2005 2:25 PM
To: [email protected]
Subject: Fetching pages with query strings


By default the regex-urlfilter.txt file excludes URLs that contain query
strings (i.e. include "?"). Could somebody explain the reason for excluding
these sites. Is there something risky about including them in a crawl? Is
there anyone who is no excluding these files, and if so, how has it worked
out? The reason I ask is that some of the domains I'm hoping to crawl use
query strings for most of their pages.

Thanks,
Bryan

RE: Fetching pages with query strings

Reply via email to