You could easily write Cascading apps that could pull all the data into a single source and perform the processing.
You could also use it to launch jobs in different clusters from a single application (each Flow can be given unique properties causing it to run mr jobs on arbitrary clusters). So you can effectively run number crunching remotely on each independent cluster and then have the results pulled down to a single cluster and then loaded into any backend systems. Cascading can coordinate the scheduling of the Flows across clusters (via the Cascade abstraction). ckw On Nov 3, 2010, at 12:18 PM, Jason Smith wrote: > I am looking into the problem of running jobs to generate statistics across > a large data set that would be split into different clusters > geographically. Each cluster would have a unique piece of the overall data > set, as the network overhead to collocate the data would be too much. I > tried searching around for any tools that might help orchestrate something > like this, but did not find anything. Are there any tools I'm missing that I > should look into to? > > Thanks > Jason -- Chris K Wensel [email protected] http://www.concurrentinc.com -- Concurrent, Inc. offers mentoring, support, and licensing for Cascading
