Hi,

My experience to this topic is the following:

I am trying to use Nutch against our intranet pages. Some pages are
stored directly in apache server. Here exclusion of question mark
makes sence because static pages countain a lot of links to dynamic
reports (made in perl pulling data from databases). Crawling pages
with question mark was quite risky because I could easily keep report
servers fully occupied which could lead to problems (on the other hand
this was good stress test... ;-). So for some portion of our intranet
it is OK to exclude question mark. (Not to say that indexing report
which is changing every moment is not a good practice if indexing
tooks two days...)

But we also have some server which is user as a document repository
but here all links contain question marks. So I allowed question mark
and result is that my index contains duplicities (in terms of
duplicite content). Most of these links contain several parameters and
one of them is called orderBy and defines order of documents on the
page (by date, by name, by size ...). So as a result I earned several
links to the page with the same content (only order of items differ)
but with different URL.

May be if I filter orderBy attribute in regex-normalize.xml then this
problem would be solved... I don't know.

Regards,
Lukas

On 8/16/05, Nick Temple <[EMAIL PROTECTED]> wrote:
> True, fetching and indexing in Nutch isn't quite the same thing, is it?
> Maybe I should have said "nearly duplicate content" and "long crawls" .. the
> nearly dupicate stuff comes from the exact same content being sorted
> multiple different ways, as in the case of a standard Apache index page.
> 
> Your point about the sessionid's, timestamps and others are right on spot,
> too - I had forgotten about them.
> 
> Maybe it would make sense for people to begin putting together a list of
> regexes that excludes the most common problematic query strings?
> 
> Nick
> 
> -----Original Message-----
> From: EM [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, August 16, 2005 3:53 PM
> To: [email protected]
> Subject: RE: Fetching pages with query strings
> 
> 
> Actually, duplicates are excluded on the end, so they aren't kept at all,
> but one will end up *fetching* a lot of duplicated content constantly.
> 
> -----Original Message-----
> From: Nick Temple [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, August 16, 2005 3:52 PM
> To: [email protected]
> Subject: RE: Fetching pages with query strings
> 
> Hi Bryan --
> 
> When using a previos system, I did begin to exclude sites with query
> strings. Some common software uses query strings to create different views
> of the same data ... I ended up crawling one bulletin board for days - I
> believe at this point that there was something recursive in nature about
> some sites.  The fact that a site does not use a query string does not, by
> itself, mean that the site isn't recursive.
> 
> I'm not sure if Nutch itself has an internal limitation on the
> representation of URL's, if so, then that would be the only reason I can
> think of to exclude query strings entirely -- but watch your crawls, and
> begin to ban recursive systems, or you may end up with a lot of duplicate
> content.
> 
> Nick
> 
> -----Original Message-----
> From: Bryan Woliner [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, August 16, 2005 2:25 PM
> To: [email protected]
> Subject: Fetching pages with query strings
> 
> 
> By default the regex-urlfilter.txt file excludes URLs that contain query
> strings (i.e. include "?"). Could somebody explain the reason for excluding
> these sites. Is there something risky about including them in a crawl? Is
> there anyone who is no excluding these files, and if so, how has it worked
> out? The reason I ask is that some of the domains I'm hoping to crawl use
> query strings for most of their pages.
> 
> Thanks,
> Bryan
> 
> 
> 
> 
> 
>

Reply via email to