Re: Out-of-core random forest implementation

Marty Kube Fri, 08 Mar 2013 05:45:35 -0800

I'm not sure, but in this case I don't think we have much iterationinvariant data.


On 03/08/2013 08:39 AM, Sebastian Schelter wrote:

Well, this is certainly possible and is an approach that is used in our
ALS code. But the startup latency and the need to rescan
iteration-invariant data usually typically induce an overhead of an
order of magnitude compared to approaches specialized for distributed
iterations.


Best,
Sebastian

On 08.03.2013 14:36, Marty Kube wrote:

What about using one map reduce job per iteration?  The models you load
into distributed cache are the model from the last round and the reducer
can emit the expanded model.  We are presumably working with large data
sets so I would not expect start-up latency to be an issue.

On 03/07/2013 04:56 PM, Ted Dunning wrote:

On Thu, Mar 7, 2013 at 6:25 AM, Andy Twigg <[email protected]> wrote:

... Right now what we have is a
single-machine procedure for scanning through some data, building a
set of histograms, combining histograms and then expanding the tree.
The next step is to decide the best way to distribute this. I'm not an
expert here, so any advice or help here is welcome.

That sounds good so far.

I think the easiest approach would be to use the mappers to construct
the set of histograms, and then send all histograms for a given leaf
to a reducer, which decides how to expand that leaf. The code I have
can be almost be ported as-is to a mapper and reducer in this way.
Would using the distributed cache to send the updated tree be wise, or
is there a better way?

Distributed cache is a very limited thing.  You can only put things in at
program launch and they must remain constant throughout the program's
run.

The problem here is that iterated map-reduce is pretty heinously
inefficient.

The best candidate approaches for avoiding that are to use a BSP sort of
model (see the Pregel paper at
http://kowshik.github.com/JPregel/pregel_paper.pdf ) or use an
unsynchronized model update cycle the way that Vowpal Wabbit does with
all-reduce or the way that Google's deep learning system does.

Running these approaches on Hadoop without Yarn or Mesos requires a
slight
perversion of the map-reduce paradigm, but is quite doable.

Re: Out-of-core random forest implementation

Reply via email to