See Giraph. On Thu, Mar 7, 2013 at 6:01 PM, Andy Twigg <[email protected]> wrote:
> That sounds like a horrid amount of work to do something simple. Is there a > hadoop implementation of a master-workers problem you can point me to? > On Mar 7, 2013 9:57 PM, "Ted Dunning" <[email protected]> wrote: > > > On Thu, Mar 7, 2013 at 6:25 AM, Andy Twigg <[email protected]> wrote: > > > > > ... Right now what we have is a > > > single-machine procedure for scanning through some data, building a > > > set of histograms, combining histograms and then expanding the tree. > > > The next step is to decide the best way to distribute this. I'm not an > > > expert here, so any advice or help here is welcome. > > > > > > > That sounds good so far. > > > > > > > I think the easiest approach would be to use the mappers to construct > > > the set of histograms, and then send all histograms for a given leaf > > > to a reducer, which decides how to expand that leaf. The code I have > > > can be almost be ported as-is to a mapper and reducer in this way. > > > Would using the distributed cache to send the updated tree be wise, or > > > is there a better way? > > > > > > > Distributed cache is a very limited thing. You can only put things in at > > program launch and they must remain constant throughout the program's > run. > > > > The problem here is that iterated map-reduce is pretty heinously > > inefficient. > > > > The best candidate approaches for avoiding that are to use a BSP sort of > > model (see the Pregel paper at > > http://kowshik.github.com/JPregel/pregel_paper.pdf ) or use an > > unsynchronized model update cycle the way that Vowpal Wabbit does with > > all-reduce or the way that Google's deep learning system does. > > > > Running these approaches on Hadoop without Yarn or Mesos requires a > slight > > perversion of the map-reduce paradigm, but is quite doable. > > >
