David Larochelle wrote:
> Currently, the driver process periodically queries a database to get
> a list of URLs to [crawl]. It then stores these url's to be
> downloaded in a complex in memory [structure?] and pipes them to
> separate processes that do the actual downloading.
>
> The problem is that the database queries are slow and block
> the driver process.

Your description leaves out a lot of details about what sort of data is
being passed back and forth between the different processing stages.

So the first stage is querying a database to get a list of URLs. You
then say it stores the URL in a complex memory structure? Why? Why isn't
the input to the next stage simply a URL?

If you were running a collection of crawler processes, and each had its
own code to retrieve a URL from the database, it wouldn't matter if it
blocked. Is there enough meta-data to allow independent crawler
processes to each pick a URL without overlapping with the other processes?

Another possibility is to create some middleware. A server that queries
the database, builds a queue in memory, then accepts connections from
the crawler processes and hands out a URL from the queue to each.

Without know what sort of data you are exchanging, and how frequently, I
can't say whether in-memory IPC and threads are good/necessary
solutions, or if you'd be better off just running a bunch of independent
processes.

It's hard to say from what you've described so far, but this is sounding
like a map-reduce problem. If you followed that algorithm, the first
stage builds the list of URLs to crawl. The second stage spawns a pile
of children to crawl the URLs individually or in batches, and produces
intermediary results. The final stage aggregates the results.

(There are some existing modules and tools you could use to implement
this. Hadoop, for example.
http://en.wikipedia.org/wiki/Map_reduce
http://search.cpan.org/~drrho/Parallel-MapReduce-0.09/lib/Parallel/MapReduce.pm
http://www.slideshare.net/philwhln/map-reduce-using-perl
http://hadoop.apache.org/
)

Whatever you choose for IPC, I'd give consideration to how that
mechanism could be used across a cluster of machines, so you have the
option to scale up.

 -Tom

-- 
Tom Metro
Venture Logic, Newton, MA, USA
"Enterprise solutions through open source."
Professional Profile: http://tmetro.venturelogic.com/

_______________________________________________
Boston-pm mailing list
[email protected]
http://mail.pm.org/mailman/listinfo/boston-pm

Reply via email to