[Nutch-general] Re: Crawling search engines and cgi scripts

Insurance Squared Inc. Fri, 16 Dec 2005 13:35:05 -0800

A bit more info and maybe another concern.

Here's an example of a url that got crawled:
http://search.aol.ca/redir?urn=http://www.tachyonlabs.com/games.html&url=http://www.tachyonlabs.com/games.html&requestId=fc3678a2dd20b2da&clickedItemRank=1&source=aoldirectory&searchType=MS&query=Games

Not bad on the surface, however as I mentioned, this seems to be comingfrom a dynamic search - and there's a whole lot of them. Shouldwe/could we be doing something to stop this?

Secondly, that page is actually a redirect. It's crawling and indexingthe redirected page. That'd be fine, except we've got some regularexpressions in the filter that would prevent this redirected site frombeing indexed. However since the original url does pass (and theredirected doesn't) we end up with sites that are getting past the regexin the filter. Any general thoughts on how we might start to tackle this?


Thanks.


Insurance Squared Inc. wrote:

We're running a crawl using nutch and the last crawl seemed to betaking a long time. Looking at the output, it seems it's gone intoAOL's search and is actually crawling search results (it's alsocrawling some cgi-bin search results page on another site). This sureseems like it could go on forever.
Admittedly we haven't looked at this very deeply yet (I'm not sure whyit's got so many search pages on AOL to crawl), but this strikes methat it's likely a common occurrence if it's acting that way. Isthere something we should be doing to prevent this situation?
Thanks.



-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] Re: Crawling search engines and cgi scripts

Reply via email to