On Fri, Mar 8, 2013 at 6:03 PM, Sean Owen <[email protected]> wrote: > Oh, certainly. I was thinking in the realm of distributed systems only. > Surely serialization across a network is a necessary step in anything like > that. Serializing to local disk first, or a distributed file system, may > not be.
Right on both. Serializing isn't much of the issue. It is the disk and the hard checkpointing. > The local writes may not matter. But wouldn't YARN-type setups > still be writing to distributed storage? > Not necessarily. Mesos and Yarn allow you to run programs across a cluster. Both have the notion of a distinguished node that requests the others. What these nodes actually do is up to the implementor. You could checkpointing or not. > My broad hunch is that communicating the same amount of data faster > probably doesn't get an order of magnitude faster, but a different paradigm > that lets you transmit less data does. Well, with k-means, Hadoop map-reduce implementations are at least an order of magnitude slower than, say, a Shark or Graphlab implementation. The Shark implementation is pretty much one-for-one and the graphlab version allows asynchronous updates to centroids. > I was musing about whether M/R > forced you into a hopelessly huge amount of I/O implementation for RF If your data fits in cluster memory and you aren't running a caching implementation, it definitely increases disk I/O. If your data doesn't fit in memory you get a kinda not scalable implementation. You have to pass over your data a number of times roughly proportional to the depth of your tree. Your tree will be deeper for bigger data. Thus you get super-linear scaling which is my definition of not very scalable. Hopefully the overage is log N or less so that you can get away with it. These days I continue to want a better sense not just of whether entire paradigms are more/less suitable and how and when and why, but when two > different concepts in the same paradigm are qualitatively different or just > a different point on a tradeoff curve, optimizing for a different type of > problem. > Yes. Yes. For instance, stochastic svd and streaming k-means both radically change the game when it comes to map-reduce. But the real issue has to do with whether scaling is truly linear or not. > > > On Fri, Mar 8, 2013 at 10:35 PM, Ted Dunning <[email protected]> > wrote: > > > The big cost in map-reduce iteration isn't just startup. It is that the > > input has to be read from disk and the output written to same. Were it > to > > stay in memory, things would be vastly faster. > > > > Also, startup costs are still pretty significant. Even on MapR, one of > the > > major problems in setting the recent minute-sort record was getting > things > > to start quickly. Just setting the heartbeat faster doesn't work on > > ordinary Hadoop because there is a global lock that begins to starve the > > system. We (our guys, not me) had to seriously hack the job tracker to > > move things out of that critical section. At that point, we were able to > > shorten the heartbeat interval to 800 ms (which on 2000 nodes means >2400 > > heartbeats per second). The startup and cleanup tasks are also single > > threaded. > > > > It might be plausible to shorten to this degree and further on a small > > cluster. But iteration is still very painful. > > > > >
