True, fetching and indexing in Nutch isn't quite the same thing, is it? Maybe I should have said "nearly duplicate content" and "long crawls" .. the nearly dupicate stuff comes from the exact same content being sorted multiple different ways, as in the case of a standard Apache index page.
Your point about the sessionid's, timestamps and others are right on spot, too - I had forgotten about them. Maybe it would make sense for people to begin putting together a list of regexes that excludes the most common problematic query strings? Nick -----Original Message----- From: EM [mailto:[EMAIL PROTECTED] Sent: Tuesday, August 16, 2005 3:53 PM To: [email protected] Subject: RE: Fetching pages with query strings Actually, duplicates are excluded on the end, so they aren't kept at all, but one will end up *fetching* a lot of duplicated content constantly. -----Original Message----- From: Nick Temple [mailto:[EMAIL PROTECTED] Sent: Tuesday, August 16, 2005 3:52 PM To: [email protected] Subject: RE: Fetching pages with query strings Hi Bryan -- When using a previos system, I did begin to exclude sites with query strings. Some common software uses query strings to create different views of the same data ... I ended up crawling one bulletin board for days - I believe at this point that there was something recursive in nature about some sites. The fact that a site does not use a query string does not, by itself, mean that the site isn't recursive. I'm not sure if Nutch itself has an internal limitation on the representation of URL's, if so, then that would be the only reason I can think of to exclude query strings entirely -- but watch your crawls, and begin to ban recursive systems, or you may end up with a lot of duplicate content. Nick -----Original Message----- From: Bryan Woliner [mailto:[EMAIL PROTECTED] Sent: Tuesday, August 16, 2005 2:25 PM To: [email protected] Subject: Fetching pages with query strings By default the regex-urlfilter.txt file excludes URLs that contain query strings (i.e. include "?"). Could somebody explain the reason for excluding these sites. Is there something risky about including them in a crawl? Is there anyone who is no excluding these files, and if so, how has it worked out? The reason I ask is that some of the domains I'm hoping to crawl use query strings for most of their pages. Thanks, Bryan
