I've been experiencing an intermittent crash where no python stacktrace is provided. It happens for a url downloading process that can last up to 12 hours and crawls about 50,000 urls.
I'm using urllib2 for the downloads. There are 5-10 downloading threads, and some custom website exploration code for providing the urls to crawl. The downloads are completed in memory (not piped), then saved to a file. There are also nice per domain / IP guidelines upheld so lots of concurrent downloads and exploration are either waiting or taking place sometimes up to 40 at once. As a result, I've seen the process memory footprint clime upwards of 800 megs. About 20-40% of the time, the entire process bails out with no stacktrace, at random memory allocation and running time periods.. sometimes as little as 2 hours. My guess is that there is a bug in urllib2 or some third party software I'm using, or it was not meant to be run in a multithreaded environment. Decreasing the bandwidth/aggressiveness of the crawler MAY seem to have an effect on the frequency.. haven't done any formal 'studies' on that yet. My current solution is to restart the crawler, but this is bad business to the websites (recrawling), and extra crawl time on my part. I bet if I switch to a 1-download-per-process scenario with pyro for IPC (to uphold niceness rules, etc), I will fix this situation as I suspect from reading similar SIGABRT issues that it has something to do with the multi-threading. But I figured I'd ask around before I take such drastic measures. Since the process is so long-running, I have not tried running strace, and I'm not even sure if it would make sense to me or someone else. Let me know if you have a method of catching just the last 1000 calls and not saving earlier ones or whatever, if that would be useful. I'm using an older version of Python 2.4.4c1. Since the bug is intermittent, I'm not sure yet if an upgrade to Pyhton 2.5 has solved my problem. Does anyone have any clues for me to try? My threading code uses a messaging queue per thread, and one notification queue that the main thread checks and assigns new crawls back to free threads. No other variables are referenced by multiple threads other than the thread objects themselves (to my knowledge). -- http://mail.python.org/mailman/listinfo/python-list