Re: Out-of-core random forest implementation

Sean Owen Sat, 09 Mar 2013 01:16:47 -0800

On Fri, Mar 8, 2013 at 11:54 PM, Ted Dunning <[email protected]> wrote:


> Right on both.  Serializing isn't much of the issue.  It is the disk and
> the hard checkpointing.
>
> Well, with k-means, Hadoop map-reduce implementations are at least an order
> of magnitude slower than, say, a Shark or Graphlab implementation.  The
> Shark implementation is pretty much one-for-one and the graphlab version
> allows asynchronous updates to centroids.
>

The pattern here seems to be 'checkpointing' as the culprit -- that's well
said. In an iterative M/R process you are necessarily writing a lot of
state to HDFS just to read it back again. Whereas the alternatives
mentioned here are based on long-lived workers holding on to data over long
periods. Some checkpointing is necessary to be robust to failure over hours
of work, but I assume (?) these accomplish this in a lighter-weight way.

Hmm, my long-standing assumption/experience had been that the I/O wasn't a
big part of the run-time. But I'm working on a particular set of tasks that
jumps through hoops to avoid I/O. So if it runs for 10 minutes to write out
100MB of data, no big deal. At smaller scales, for different distributions,
and for different algorithms -- not necessarily true.

My disposition has been to take M/R as a given for now, because in practice
it's so widely available, and figure out what works well within those
constraints. I feel more compelled than ever to optimize away I/O, but it
seems to work just fine for certain approaches, done with care, even when
iterative.

But what do you think of my distinction above? personally that would be a
bright line that I'm looking for to conclude that big-learning-style
problems ought to be moving at last in 2013 to a different paradigm. I
hadn't quite had that before for BSP or graph-oriented or other paradigms.



> If your data fits in cluster memory and you aren't running a caching
> implementation, it definitely increases disk I/O.
>

I was going to say that fitting in cluster memory seems like a non-trivial
condition -- but for a classifier problem, and for even quite large data
sets, probably not. I'm interested in avoiding this condition though, if
the price is only "moderate".


>
> If your data doesn't fit in memory you get a kinda not scalable
> implementation.  You have to pass over your data a number of times roughly
> proportional to the depth of your tree.  Your tree will be deeper for
> bigger data.  Thus you get super-linear scaling which is my definition of
> not very scalable.  Hopefully the overage is log N or less so that you can
> get away with it.


Yes a # of passes over the data equal to the depth of the trees is what I
had in mind. I thought approaches dismissed earlier in this thread were
contemplating something that sent around much more data than that. Good
point, that is super-linear. Lightly maybe; the depth of the tree is still
softly bounded by the minimum leaf size or hard-bounded by a max depth.

I am still not 100% clear how you would avoid evaluating the data this many
times... and how you do that without reading or transferring it around...
but I haven't thought about it for longer than the minutes writing this
message.

Re: Out-of-core random forest implementation

Reply via email to