Hi Bryan -- When using a previos system, I did begin to exclude sites with query strings. Some common software uses query strings to create different views of the same data ... I ended up crawling one bulletin board for days - I believe at this point that there was something recursive in nature about some sites. The fact that a site does not use a query string does not, by itself, mean that the site isn't recursive.
I'm not sure if Nutch itself has an internal limitation on the representation of URL's, if so, then that would be the only reason I can think of to exclude query strings entirely -- but watch your crawls, and begin to ban recursive systems, or you may end up with a lot of duplicate content. Nick -----Original Message----- From: Bryan Woliner [mailto:[EMAIL PROTECTED] Sent: Tuesday, August 16, 2005 2:25 PM To: [email protected] Subject: Fetching pages with query strings By default the regex-urlfilter.txt file excludes URLs that contain query strings (i.e. include "?"). Could somebody explain the reason for excluding these sites. Is there something risky about including them in a crawl? Is there anyone who is no excluding these files, and if so, how has it worked out? The reason I ask is that some of the domains I'm hoping to crawl use query strings for most of their pages. Thanks, Bryan
