Well, you should and shouldn't crawl them, it all depends how much free time you have to dedicate on the subject.
Some of my best quality sites have '?' in them. But also, some other sites that I crawl use it and make tons of problems. Some of the 'troublemakers' generate session ids and attach it to the parameter part. Some other ones will attach the current time as parameter. There are all kinds of weird stuff people feel obliged to pass through the URL. If you choose to go with the dynamically generated pages, be prepared to spend time LOOKING at the fetcher output for any misbehaving and fix things by hand. Site per site. If you have a handful of sites, you should be fine, if you do anything large scale, it's better to ignore the '?' until the developers choose to implement a bit more configurable limitations per website. EM -----Original Message----- From: Bryan Woliner [mailto:[EMAIL PROTECTED] Sent: Tuesday, August 16, 2005 2:25 PM To: [email protected] Subject: Fetching pages with query strings By default the regex-urlfilter.txt file excludes URLs that contain query strings (i.e. include "?"). Could somebody explain the reason for excluding these sites. Is there something risky about including them in a crawl? Is there anyone who is no excluding these files, and if so, how has it worked out? The reason I ask is that some of the domains I'm hoping to crawl use query strings for most of their pages. Thanks, Bryan
