Hi Bryan --

When using a previos system, I did begin to exclude sites with query
strings. Some common software uses query strings to create different views
of the same data ... I ended up crawling one bulletin board for days - I
believe at this point that there was something recursive in nature about
some sites.  The fact that a site does not use a query string does not, by
itself, mean that the site isn't recursive.

I'm not sure if Nutch itself has an internal limitation on the
representation of URL's, if so, then that would be the only reason I can
think of to exclude query strings entirely -- but watch your crawls, and
begin to ban recursive systems, or you may end up with a lot of duplicate
content.

Nick

-----Original Message-----
From: Bryan Woliner [mailto:[EMAIL PROTECTED]
Sent: Tuesday, August 16, 2005 2:25 PM
To: [email protected]
Subject: Fetching pages with query strings


By default the regex-urlfilter.txt file excludes URLs that contain query
strings (i.e. include "?"). Could somebody explain the reason for excluding
these sites. Is there something risky about including them in a crawl? Is
there anyone who is no excluding these files, and if so, how has it worked
out? The reason I ask is that some of the domains I'm hoping to crawl use
query strings for most of their pages.

Thanks,
Bryan

Reply via email to