I generally agree that Hadoop is quite a bit of overhead, but I was shocked (and stunned) when I re-implemented a recommendation engine using Hadoop a few years back. The reference implementation was a sparse matrix in-memory model that was slightly tuned for small space. The sparse matrix was similar to what we have now, but all of the integers were byte-wise compressed.
The serious knock me back on my heels moment was when I measured a quick and dirty map-reduce version running on the same data. Without any consideration to memory use or optimization, the local invocation of the map-reduce version actually ran faster than the in-memory version. What had happened is that my memory efficiency zealotry combined with the random access nature of my program was making accesses expensive and was blowing the L2 cache completely. That resulted in my program running hundreds of times slower than it might have with good cache coherency and without the compressed integer overheads. On the other hand, the hadoop version was reading data from disk in completely sequential fashion and was making use of some very well written sort and merge routines. The result was that it was hitting cached data way more than my other program was. The net result was that using hadoop on a single machine running an out-of-core program was slightly faster than my fancy in-core answer. Moving to multi-machine hadoop incurred a lot more overhead, but I was able to work on hundreds of times larger data sets with about 10x the hardware in the same time. This story can be read many ways. One way is to read it as saying what a putz I am for writing a not very clever (or entirely too clever) program in the first place. Another reading is as another example of how disk and memory have become more like tapes than like random access devices. Another reading is that the discipline of map-reduce is good for writing simple, fast programs. Regardless, I haven't looked back since that day. If I have a batch program of any scale at all, it goes into map-reduce form at my earliest opportunity. On Sat, Nov 21, 2009 at 10:44 PM, Sean Owen <[email protected]> wrote: > think we can already state the answer though: it's going to take > much more CPU time and resources to run a computation via Hadoop than > run it completely on one machine (non-parallelized). Hadoop is a lot > of overhead. > > ... > Are you wondering how much the overhead is, of a framework like Hadoop? > -- Ted Dunning, CTO DeepDyve
