Well, this is certainly possible and is an approach that is used in our ALS code. But the startup latency and the need to rescan iteration-invariant data usually typically induce an overhead of an order of magnitude compared to approaches specialized for distributed iterations.
Best, Sebastian On 08.03.2013 14:36, Marty Kube wrote: > What about using one map reduce job per iteration? The models you load > into distributed cache are the model from the last round and the reducer > can emit the expanded model. We are presumably working with large data > sets so I would not expect start-up latency to be an issue. > > On 03/07/2013 04:56 PM, Ted Dunning wrote: >> On Thu, Mar 7, 2013 at 6:25 AM, Andy Twigg <[email protected]> wrote: >> >>> ... Right now what we have is a >>> single-machine procedure for scanning through some data, building a >>> set of histograms, combining histograms and then expanding the tree. >>> The next step is to decide the best way to distribute this. I'm not an >>> expert here, so any advice or help here is welcome. >>> >> That sounds good so far. >> >> >>> I think the easiest approach would be to use the mappers to construct >>> the set of histograms, and then send all histograms for a given leaf >>> to a reducer, which decides how to expand that leaf. The code I have >>> can be almost be ported as-is to a mapper and reducer in this way. >>> Would using the distributed cache to send the updated tree be wise, or >>> is there a better way? >>> >> Distributed cache is a very limited thing. You can only put things in at >> program launch and they must remain constant throughout the program's >> run. >> >> The problem here is that iterated map-reduce is pretty heinously >> inefficient. >> >> The best candidate approaches for avoiding that are to use a BSP sort of >> model (see the Pregel paper at >> http://kowshik.github.com/JPregel/pregel_paper.pdf ) or use an >> unsynchronized model update cycle the way that Vowpal Wabbit does with >> all-reduce or the way that Google's deep learning system does. >> >> Running these approaches on Hadoop without Yarn or Mesos requires a >> slight >> perversion of the map-reduce paradigm, but is quite doable. >> >
