A bit more info and maybe another concern.

Here's an example of a url that got crawled:
http://search.aol.ca/redir?urn=http://www.tachyonlabs.com/games.html&url=http://www.tachyonlabs.com/games.html&requestId=fc3678a2dd20b2da&clickedItemRank=1&source=aoldirectory&searchType=MS&query=Games

Not bad on the surface, however as I mentioned, this seems to be coming from a dynamic search - and there's a whole lot of them. Should we/could we be doing something to stop this?

Secondly, that page is actually a redirect. It's crawling and indexing the redirected page. That'd be fine, except we've got some regular expressions in the filter that would prevent this redirected site from being indexed. However since the original url does pass (and the redirected doesn't) we end up with sites that are getting past the regex in the filter. Any general thoughts on how we might start to tackle this?

Thanks.


Insurance Squared Inc. wrote:

We're running a crawl using nutch and the last crawl seemed to be taking a long time. Looking at the output, it seems it's gone into AOL's search and is actually crawling search results (it's also crawling some cgi-bin search results page on another site). This sure seems like it could go on forever.

Admittedly we haven't looked at this very deeply yet (I'm not sure why it's got so many search pages on AOL to crawl), but this strikes me that it's likely a common occurrence if it's acting that way. Is there something we should be doing to prevent this situation?

Thanks.



-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to