Re: Out-of-core random forest implementation

Ted Dunning Fri, 08 Mar 2013 15:55:49 -0800

On Fri, Mar 8, 2013 at 6:03 PM, Sean Owen <[email protected]> wrote:

> Oh, certainly. I was thinking in the realm of distributed systems only.
> Surely serialization across a network is a necessary step in anything like
> that. Serializing to local disk first, or a distributed file system, may
> not be.

Right on both.  Serializing isn't much of the issue.  It is the disk and
the hard checkpointing.

> The local writes may not matter. But wouldn't YARN-type setups
> still be writing to distributed storage?
>

Not necessarily.  Mesos and Yarn allow you to run programs across a
cluster.  Both have the notion of a distinguished node that requests the
others.  What these nodes actually do is up to the implementor.  You could
checkpointing or not.

> My broad hunch is that communicating the same amount of data faster
> probably doesn't get an order of magnitude faster, but a different paradigm
> that lets you transmit less data does.

Well, with k-means, Hadoop map-reduce implementations are at least an order
of magnitude slower than, say, a Shark or Graphlab implementation.  The
Shark implementation is pretty much one-for-one and the graphlab version
allows asynchronous updates to centroids.

> I was musing about whether M/R
> forced you into a hopelessly huge amount of I/O implementation for RF

If your data fits in cluster memory and you aren't running a caching
implementation, it definitely increases disk I/O.

If your data doesn't fit in memory you get a kinda not scalable
implementation.  You have to pass over your data a number of times roughly
proportional to the depth of your tree.  Your tree will be deeper for
bigger data.  Thus you get super-linear scaling which is my definition of
not very scalable.  Hopefully the overage is log N or less so that you can
get away with it.

These days I continue to want a better sense not just of whether entire

paradigms are more/less suitable and how and when and why, but when two
> different concepts in the same paradigm are qualitatively different or just
> a different point on a tradeoff curve, optimizing for a different type of
> problem.
>

Yes.  Yes.

For instance, stochastic svd and streaming k-means both radically change
the game when it comes to map-reduce.

But the real issue has to do with whether scaling is truly linear or not.

>
>
> On Fri, Mar 8, 2013 at 10:35 PM, Ted Dunning <[email protected]>
> wrote:
>
> > The big cost in map-reduce iteration isn't just startup.  It is that the
> > input has to be read from disk and the output written to same.  Were it
> to
> > stay in memory, things would be vastly faster.
> >
> > Also, startup costs are still pretty significant.  Even on MapR, one of
> the
> > major problems in setting the recent minute-sort record was getting
> things
> > to start quickly.  Just setting the heartbeat faster doesn't work on
> > ordinary Hadoop because there is a global lock that begins to starve the
> > system.  We (our guys, not me) had to seriously hack the job tracker to
> > move things out of that critical section.  At that point, we were able to
> > shorten the heartbeat interval to 800 ms (which on 2000 nodes means >2400
> > heartbeats per second).  The startup and cleanup tasks are also single
> > threaded.
> >
> > It might be plausible to shorten to this degree and further on a small
> > cluster.  But iteration is still very painful.
> >
> >
>

Re: Out-of-core random forest implementation

Reply via email to