Re: Fetching pages with query strings

Bryan Woliner Wed, 17 Aug 2005 09:44:13 -0700

Thanks for all the helpful comments. Seems like you need to be familiar with 
the content of a domain before you start crawling its URLs that contain 
query strings.


On 8/16/05, Lukas Vlcek <[EMAIL PROTECTED]> wrote:
> 
> Hi,
> 
> My experience to this topic is the following:
> 
> I am trying to use Nutch against our intranet pages. Some pages are
> stored directly in apache server. Here exclusion of question mark
> makes sence because static pages countain a lot of links to dynamic
> reports (made in perl pulling data from databases). Crawling pages
> with question mark was quite risky because I could easily keep report
> servers fully occupied which could lead to problems (on the other hand
> this was good stress test... ;-). So for some portion of our intranet
> it is OK to exclude question mark. (Not to say that indexing report
> which is changing every moment is not a good practice if indexing
> tooks two days...)
> 
> But we also have some server which is user as a document repository
> but here all links contain question marks. So I allowed question mark
> and result is that my index contains duplicities (in terms of
> duplicite content). Most of these links contain several parameters and
> one of them is called orderBy and defines order of documents on the
> page (by date, by name, by size ...). So as a result I earned several
> links to the page with the same content (only order of items differ)
> but with different URL.
> 
> May be if I filter orderBy attribute in regex-normalize.xml then this
> problem would be solved... I don't know.
> 
> Regards,
> Lukas
> 
> On 8/16/05, Nick Temple <[EMAIL PROTECTED]> wrote:
> > True, fetching and indexing in Nutch isn't quite the same thing, is it?
> > Maybe I should have said "nearly duplicate content" and "long crawls" .. 
> the
> > nearly dupicate stuff comes from the exact same content being sorted
> > multiple different ways, as in the case of a standard Apache index page.
> >
> > Your point about the sessionid's, timestamps and others are right on 
> spot,
> > too - I had forgotten about them.
> >
> > Maybe it would make sense for people to begin putting together a list of
> > regexes that excludes the most common problematic query strings?
> >
> > Nick
> >
> > -----Original Message-----
> > From: EM [mailto:[EMAIL PROTECTED]
> > Sent: Tuesday, August 16, 2005 3:53 PM
> > To: [email protected]
> > Subject: RE: Fetching pages with query strings
> >
> >
> > Actually, duplicates are excluded on the end, so they aren't kept at 
> all,
> > but one will end up *fetching* a lot of duplicated content constantly.
> >
> > -----Original Message-----
> > From: Nick Temple [mailto:[EMAIL PROTECTED]
> > Sent: Tuesday, August 16, 2005 3:52 PM
> > To: [email protected]
> > Subject: RE: Fetching pages with query strings
> >
> > Hi Bryan --
> >
> > When using a previos system, I did begin to exclude sites with query
> > strings. Some common software uses query strings to create different 
> views
> > of the same data ... I ended up crawling one bulletin board for days - I
> > believe at this point that there was something recursive in nature about
> > some sites. The fact that a site does not use a query string does not, 
> by
> > itself, mean that the site isn't recursive.
> >
> > I'm not sure if Nutch itself has an internal limitation on the
> > representation of URL's, if so, then that would be the only reason I can
> > think of to exclude query strings entirely -- but watch your crawls, and
> > begin to ban recursive systems, or you may end up with a lot of 
> duplicate
> > content.
> >
> > Nick
> >
> > -----Original Message-----
> > From: Bryan Woliner [mailto:[EMAIL PROTECTED]
> > Sent: Tuesday, August 16, 2005 2:25 PM
> > To: [email protected]
> > Subject: Fetching pages with query strings
> >
> >
> > By default the regex-urlfilter.txt file excludes URLs that contain query
> > strings (i.e. include "?"). Could somebody explain the reason for 
> excluding
> > these sites. Is there something risky about including them in a crawl? 
> Is
> > there anyone who is no excluding these files, and if so, how has it 
> worked
> > out? The reason I ask is that some of the domains I'm hoping to crawl 
> use
> > query strings for most of their pages.
> >
> > Thanks,
> > Bryan
> >
> >
> >
> >
> >
> >
>

Re: Fetching pages with query strings

Reply via email to