Well, you should and shouldn't crawl them, it all depends how much free time
you have to dedicate on the subject.

Some of my best quality sites have '?' in them. But also, some other sites
that I crawl use it and make tons of problems. Some of the 'troublemakers'
generate session ids and attach it to the parameter part. Some other ones
will attach the current time as parameter. There are all kinds of weird
stuff people feel obliged to pass through the URL.

If you choose to go with the dynamically generated pages, be prepared to
spend time LOOKING at the fetcher output for any misbehaving and fix things
by hand. Site per site. If you have a handful of sites, you should be fine,
if you do anything large scale, it's better to ignore the '?' until the
developers choose to implement a bit more configurable limitations per
website.

EM
-----Original Message-----
From: Bryan Woliner [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, August 16, 2005 2:25 PM
To: [email protected]
Subject: Fetching pages with query strings

By default the regex-urlfilter.txt file excludes URLs that contain query 
strings (i.e. include "?"). Could somebody explain the reason for excluding 
these sites. Is there something risky about including them in a crawl? Is 
there anyone who is no excluding these files, and if so, how has it worked 
out? The reason I ask is that some of the domains I'm hoping to crawl use 
query strings for most of their pages.

Thanks,
Bryan

Reply via email to