Thanks for all the helpful comments. Seems like you need to be familiar with the content of a domain before you start crawling its URLs that contain query strings.
On 8/16/05, Lukas Vlcek <[EMAIL PROTECTED]> wrote: > > Hi, > > My experience to this topic is the following: > > I am trying to use Nutch against our intranet pages. Some pages are > stored directly in apache server. Here exclusion of question mark > makes sence because static pages countain a lot of links to dynamic > reports (made in perl pulling data from databases). Crawling pages > with question mark was quite risky because I could easily keep report > servers fully occupied which could lead to problems (on the other hand > this was good stress test... ;-). So for some portion of our intranet > it is OK to exclude question mark. (Not to say that indexing report > which is changing every moment is not a good practice if indexing > tooks two days...) > > But we also have some server which is user as a document repository > but here all links contain question marks. So I allowed question mark > and result is that my index contains duplicities (in terms of > duplicite content). Most of these links contain several parameters and > one of them is called orderBy and defines order of documents on the > page (by date, by name, by size ...). So as a result I earned several > links to the page with the same content (only order of items differ) > but with different URL. > > May be if I filter orderBy attribute in regex-normalize.xml then this > problem would be solved... I don't know. > > Regards, > Lukas > > On 8/16/05, Nick Temple <[EMAIL PROTECTED]> wrote: > > True, fetching and indexing in Nutch isn't quite the same thing, is it? > > Maybe I should have said "nearly duplicate content" and "long crawls" .. > the > > nearly dupicate stuff comes from the exact same content being sorted > > multiple different ways, as in the case of a standard Apache index page. > > > > Your point about the sessionid's, timestamps and others are right on > spot, > > too - I had forgotten about them. > > > > Maybe it would make sense for people to begin putting together a list of > > regexes that excludes the most common problematic query strings? > > > > Nick > > > > -----Original Message----- > > From: EM [mailto:[EMAIL PROTECTED] > > Sent: Tuesday, August 16, 2005 3:53 PM > > To: [email protected] > > Subject: RE: Fetching pages with query strings > > > > > > Actually, duplicates are excluded on the end, so they aren't kept at > all, > > but one will end up *fetching* a lot of duplicated content constantly. > > > > -----Original Message----- > > From: Nick Temple [mailto:[EMAIL PROTECTED] > > Sent: Tuesday, August 16, 2005 3:52 PM > > To: [email protected] > > Subject: RE: Fetching pages with query strings > > > > Hi Bryan -- > > > > When using a previos system, I did begin to exclude sites with query > > strings. Some common software uses query strings to create different > views > > of the same data ... I ended up crawling one bulletin board for days - I > > believe at this point that there was something recursive in nature about > > some sites. The fact that a site does not use a query string does not, > by > > itself, mean that the site isn't recursive. > > > > I'm not sure if Nutch itself has an internal limitation on the > > representation of URL's, if so, then that would be the only reason I can > > think of to exclude query strings entirely -- but watch your crawls, and > > begin to ban recursive systems, or you may end up with a lot of > duplicate > > content. > > > > Nick > > > > -----Original Message----- > > From: Bryan Woliner [mailto:[EMAIL PROTECTED] > > Sent: Tuesday, August 16, 2005 2:25 PM > > To: [email protected] > > Subject: Fetching pages with query strings > > > > > > By default the regex-urlfilter.txt file excludes URLs that contain query > > strings (i.e. include "?"). Could somebody explain the reason for > excluding > > these sites. Is there something risky about including them in a crawl? > Is > > there anyone who is no excluding these files, and if so, how has it > worked > > out? The reason I ask is that some of the domains I'm hoping to crawl > use > > query strings for most of their pages. > > > > Thanks, > > Bryan > > > > > > > > > > > > >
