On Sun, Jul 13, 2014 at 4:22 PM, Ted Dunning <[email protected]> wrote:
> I have a program that I am trying to build that has this pattern: > > broadcast state to all blocks > block map to do a bit of computation, create local state > merge all of the local states back to the global state > repeat > > What is the suggestion for merging the local state back to the global > state? > that would be one of half a dozen of distinct shuffle types in spark they call "reduce". one of the realization that I had was that the only invariant part of all shuffle types a distributed engine currently can offer is map. which was realized in a custom operatior mapBlock(). that was (almost) the only customization that this work was willing to offer. being an R-like algebraic thing first, we don't offer anything beyond that. you still can apply any of spark shuffle types if you went down to aquire a checkpoint's `rdd` and then work with full awareness of Spark environment after that. it would be very easy to accomplish what you say there. Doing distributed abstractions is probably leaning more torwards data frame work than algebra. MLI kind of has done just that -- it's just i am still not sure i see much more there than just an attempt to translate Spark to yet-another-flavor of `Spark II`-- i am likely wrong about it, there has to be so much more to MLI.
