You could easily write Cascading apps that could pull all the data into a 
single source and perform the processing.

You could also use it to launch jobs in different clusters from a single 
application (each Flow can be given unique properties causing it to run mr jobs 
on arbitrary clusters). 

So you can effectively run number crunching remotely on each independent 
cluster and then have the results pulled down to a single cluster and then 
loaded into any backend systems. Cascading can coordinate the scheduling of the 
Flows across clusters (via the Cascade abstraction).

ckw

On Nov 3, 2010, at 12:18 PM, Jason Smith wrote:

> I am looking into the problem of running jobs to generate statistics across
> a large data set that would be split into different clusters
> geographically.  Each cluster would have a unique piece of the overall data
> set, as the network overhead to collocate the data would be too much. I
> tried searching around for any tools that might help orchestrate something
> like this, but did not find anything. Are there any tools I'm missing that I
> should look into to?
> 
> Thanks
> Jason

--
Chris K Wensel
[email protected]
http://www.concurrentinc.com

-- Concurrent, Inc. offers mentoring, support, and licensing for Cascading

Reply via email to