A bit more info and maybe another concern.
Here's an example of a url that got crawled:
http://search.aol.ca/redir?urn=http://www.tachyonlabs.com/games.html&url=http://www.tachyonlabs.com/games.html&requestId=fc3678a2dd20b2da&clickedItemRank=1&source=aoldirectory&searchType=MS&query=Games
Not bad on the surface, however as I mentioned, this seems to be coming
from a dynamic search - and there's a whole lot of them. Should
we/could we be doing something to stop this?
Secondly, that page is actually a redirect. It's crawling and indexing
the redirected page. That'd be fine, except we've got some regular
expressions in the filter that would prevent this redirected site from
being indexed. However since the original url does pass (and the
redirected doesn't) we end up with sites that are getting past the regex
in the filter. Any general thoughts on how we might start to tackle this?
Thanks.
Insurance Squared Inc. wrote:
We're running a crawl using nutch and the last crawl seemed to be
taking a long time. Looking at the output, it seems it's gone into
AOL's search and is actually crawling search results (it's also
crawling some cgi-bin search results page on another site). This sure
seems like it could go on forever.
Admittedly we haven't looked at this very deeply yet (I'm not sure why
it's got so many search pages on AOL to crawl), but this strikes me
that it's likely a common occurrence if it's acting that way. Is
there something we should be doing to prevent this situation?
Thanks.
-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems? Stop! Download the new AJAX search engine that makes
searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general