Re: Out-of-core random forest implementation

Sebastian Schelter Fri, 08 Mar 2013 05:39:35 -0800

Well, this is certainly possible and is an approach that is used in our
ALS code. But the startup latency and the need to rescan
iteration-invariant data usually typically induce an overhead of an
order of magnitude compared to approaches specialized for distributed
iterations.


Best,
Sebastian

On 08.03.2013 14:36, Marty Kube wrote:
> What about using one map reduce job per iteration?  The models you load
> into distributed cache are the model from the last round and the reducer
> can emit the expanded model.  We are presumably working with large data
> sets so I would not expect start-up latency to be an issue.
> 
> On 03/07/2013 04:56 PM, Ted Dunning wrote:
>> On Thu, Mar 7, 2013 at 6:25 AM, Andy Twigg <[email protected]> wrote:
>>
>>> ... Right now what we have is a
>>> single-machine procedure for scanning through some data, building a
>>> set of histograms, combining histograms and then expanding the tree.
>>> The next step is to decide the best way to distribute this. I'm not an
>>> expert here, so any advice or help here is welcome.
>>>
>> That sounds good so far.
>>
>>
>>> I think the easiest approach would be to use the mappers to construct
>>> the set of histograms, and then send all histograms for a given leaf
>>> to a reducer, which decides how to expand that leaf. The code I have
>>> can be almost be ported as-is to a mapper and reducer in this way.
>>> Would using the distributed cache to send the updated tree be wise, or
>>> is there a better way?
>>>
>> Distributed cache is a very limited thing.  You can only put things in at
>> program launch and they must remain constant throughout the program's
>> run.
>>
>> The problem here is that iterated map-reduce is pretty heinously
>> inefficient.
>>
>> The best candidate approaches for avoiding that are to use a BSP sort of
>> model (see the Pregel paper at
>> http://kowshik.github.com/JPregel/pregel_paper.pdf ) or use an
>> unsynchronized model update cycle the way that Vowpal Wabbit does with
>> all-reduce or the way that Google's deep learning system does.
>>
>> Running these approaches on Hadoop without Yarn or Mesos requires a
>> slight
>> perversion of the map-reduce paradigm, but is quite doable.
>>
>

Re: Out-of-core random forest implementation

Reply via email to