I am looking into the problem of running jobs to generate statistics across a large data set that would be split into different clusters geographically. Each cluster would have a unique piece of the overall data set, as the network overhead to collocate the data would be too much. I tried searching around for any tools that might help orchestrate something like this, but did not find anything. Are there any tools I'm missing that I should look into to?
Thanks Jason
