David Larochelle wrote: > Currently, the driver process periodically queries a database to get > a list of URLs to [crawl]. It then stores these url's to be > downloaded in a complex in memory [structure?] and pipes them to > separate processes that do the actual downloading. > > The problem is that the database queries are slow and block > the driver process.
Your description leaves out a lot of details about what sort of data is being passed back and forth between the different processing stages. So the first stage is querying a database to get a list of URLs. You then say it stores the URL in a complex memory structure? Why? Why isn't the input to the next stage simply a URL? If you were running a collection of crawler processes, and each had its own code to retrieve a URL from the database, it wouldn't matter if it blocked. Is there enough meta-data to allow independent crawler processes to each pick a URL without overlapping with the other processes? Another possibility is to create some middleware. A server that queries the database, builds a queue in memory, then accepts connections from the crawler processes and hands out a URL from the queue to each. Without know what sort of data you are exchanging, and how frequently, I can't say whether in-memory IPC and threads are good/necessary solutions, or if you'd be better off just running a bunch of independent processes. It's hard to say from what you've described so far, but this is sounding like a map-reduce problem. If you followed that algorithm, the first stage builds the list of URLs to crawl. The second stage spawns a pile of children to crawl the URLs individually or in batches, and produces intermediary results. The final stage aggregates the results. (There are some existing modules and tools you could use to implement this. Hadoop, for example. http://en.wikipedia.org/wiki/Map_reduce http://search.cpan.org/~drrho/Parallel-MapReduce-0.09/lib/Parallel/MapReduce.pm http://www.slideshare.net/philwhln/map-reduce-using-perl http://hadoop.apache.org/ ) Whatever you choose for IPC, I'd give consideration to how that mechanism could be used across a cluster of machines, so you have the option to scale up. -Tom -- Tom Metro Venture Logic, Newton, MA, USA "Enterprise solutions through open source." Professional Profile: http://tmetro.venturelogic.com/ _______________________________________________ Boston-pm mailing list [email protected] http://mail.pm.org/mailman/listinfo/boston-pm

