On Fri, Mar 8, 2013 at 11:54 PM, Ted Dunning <[email protected]> wrote:
> Right on both. Serializing isn't much of the issue. It is the disk and > the hard checkpointing. > > Well, with k-means, Hadoop map-reduce implementations are at least an order > of magnitude slower than, say, a Shark or Graphlab implementation. The > Shark implementation is pretty much one-for-one and the graphlab version > allows asynchronous updates to centroids. > The pattern here seems to be 'checkpointing' as the culprit -- that's well said. In an iterative M/R process you are necessarily writing a lot of state to HDFS just to read it back again. Whereas the alternatives mentioned here are based on long-lived workers holding on to data over long periods. Some checkpointing is necessary to be robust to failure over hours of work, but I assume (?) these accomplish this in a lighter-weight way. Hmm, my long-standing assumption/experience had been that the I/O wasn't a big part of the run-time. But I'm working on a particular set of tasks that jumps through hoops to avoid I/O. So if it runs for 10 minutes to write out 100MB of data, no big deal. At smaller scales, for different distributions, and for different algorithms -- not necessarily true. My disposition has been to take M/R as a given for now, because in practice it's so widely available, and figure out what works well within those constraints. I feel more compelled than ever to optimize away I/O, but it seems to work just fine for certain approaches, done with care, even when iterative. But what do you think of my distinction above? personally that would be a bright line that I'm looking for to conclude that big-learning-style problems ought to be moving at last in 2013 to a different paradigm. I hadn't quite had that before for BSP or graph-oriented or other paradigms. > If your data fits in cluster memory and you aren't running a caching > implementation, it definitely increases disk I/O. > I was going to say that fitting in cluster memory seems like a non-trivial condition -- but for a classifier problem, and for even quite large data sets, probably not. I'm interested in avoiding this condition though, if the price is only "moderate". > > If your data doesn't fit in memory you get a kinda not scalable > implementation. You have to pass over your data a number of times roughly > proportional to the depth of your tree. Your tree will be deeper for > bigger data. Thus you get super-linear scaling which is my definition of > not very scalable. Hopefully the overage is log N or less so that you can > get away with it. Yes a # of passes over the data equal to the depth of the trees is what I had in mind. I thought approaches dismissed earlier in this thread were contemplating something that sent around much more data than that. Good point, that is super-linear. Lightly maybe; the depth of the tree is still softly bounded by the minimum leaf size or hard-bounded by a max depth. I am still not 100% clear how you would avoid evaluating the data this many times... and how you do that without reading or transferring it around... but I haven't thought about it for longer than the minutes writing this message.
