Re: Out-of-core random forest implementation

Sebastian Schelter Fri, 08 Mar 2013 06:16:19 -0800

We'll my general experience is that a lot of datasets used for iterative
computations are not that large after feature extraction is done. Hadoop
includes a lot of design decisions that only make sense when you scale
out to really large setups. I'm not convinced that machine learning
(except for maybe Google or facebook) really falls into this category.


If you think of graph mining or collaborative filtering, even datasets
with a few billion datapoints will need only a few dozen gigabytes and
easily fit into the aggregate main memory of a small cluster.

For example, in some recent experiments, I was able to conduct an
iteration of PageRank on a dataset with 1B edges in approx 20 seconds on
a small 26 node cluster using the Stratosphere [1] system. The authors
of GraphLab report similar numbers in recent papers [2].

I'm not sure that you can get Hadoop anywhere near that performance on a
similar setup.

[1] https://www.stratosphere.eu/
[2]
http://www.select.cs.cmu.edu/publications/paperdir/osdi2012-gonzalez-low-gu-bickson-guestrin.pdf

On 08.03.2013 15:03, Sean Owen wrote:
> Weighing in late here -- it's an interesting question whether M/R is a good
> fit for iterative processes. Today you can already optimize away most of
> the startup overhead for a job, by setting Hadoop to reuse JVMs, increasing
> heart-beat rate, etc. I know Ted can add more about how MapR tunes that
> further. So I'm not sure it's the overhead of starting jobs per se.
> 
> Unless your iterations are < 1 minute or something. And they may be in some
> situations. I don't think that's true of, say, ALS or even the RF
> implementation I have in mind. Iterations may be really fast at small scale
> too, but that's not what Hadoop is for. Or unless your iterations have
> significant init overhead, like loading data. That's a good point too.
> 
> 
> I think the problem comes if the process is not naturally iterative -- if
> it's parallel, but, the workers need not stop to sync up, then forcing them
> into an iterative process just wastes time. Most time  you're waiting for a
> straggler worker, needlessly. In this regard, I am not sure that a BSP
> paradigm is any better? But not sure anyone was pushing BSP.
> 
> 
> But I think RF could fit an iterative scheme well. a) I haven't thought it
> through completely or tried it, and b) I can imagine reasons it may work in
> theory but not in practice. But is this not roughly how you'd do it --
> maybe this is what's already being suggested? Roughly:
> 
> 
> Break up the data by feature. (Subdivide features if needed.) Map over all
> the data and distribute the single-feature data. Reducers compute an
> optimal split / decision rule and output their rules. Next iteration:
> mappers read the rules and choose a next random feature for each rule. They
> map the data, apply each rule to each record, apply the next choice of
> feature, and output the single feature again. Reducers will receive, in one
> input group, again all the data they need to compute another split. They
> output that split. Repeat.
> 
> There's a lot of devil in the details there. But this seems like a fine way
> to build a tree level by level without sending all the data all over the
> place. I suppose the problem is uneven data distribution, and that some
> decision trees will go deep. But quickly your number of input groups gets
> quite large as they get smaller, so the it ought to even out over reducers
> (?).
> 
> 
> 
> To answer Marty, I don't this project will never change much from what it
> is now. It's not even properly on Hadoop 0.20.x, much less 2.x. An
> MRv2-based project is a different project, as it would properly be a total
> rewrite. Something to start thinking about and start thinking about drawing
> a line under what's here IMHO.
> 
> 
> On Fri, Mar 8, 2013 at 1:36 PM, Marty Kube <
> [email protected]> wrote:
> 
>> What about using one map reduce job per iteration?  The models you load
>> into distributed cache are the model from the last round and the reducer
>> can emit the expanded model.  We are presumably working with large data
>> sets so I would not expect start-up latency to be an issue.
>>
>>
>

Re: Out-of-core random forest implementation

Reply via email to