I generally agree that Hadoop is quite a bit of overhead, but I was shocked
(and stunned) when I re-implemented a recommendation engine using Hadoop a
few years back.  The reference implementation was a sparse matrix in-memory
model that was slightly tuned for small space.  The sparse matrix was
similar to what we have now, but all of the integers were byte-wise
compressed.

The serious knock me back on my heels moment was when I measured a quick and
dirty map-reduce version running on the same data.  Without any
consideration to memory use or optimization, the local invocation of the
map-reduce version actually ran faster than the in-memory version.

What had happened is that my memory efficiency zealotry combined with the
random access nature of my program was making accesses expensive and was
blowing the L2 cache completely.  That resulted in my program running
hundreds of times slower than it might have with good cache coherency and
without the compressed integer overheads.

On the other hand, the hadoop version was reading data from disk in
completely sequential fashion and was making use of some very well written
sort and merge routines.  The result was that it was hitting cached data way
more than my other program was.

The net result was that using hadoop on a single machine running an
out-of-core program was slightly faster than my fancy in-core answer.
Moving to multi-machine hadoop incurred a lot more overhead, but I was able
to work on hundreds of times larger data sets with about 10x the hardware in
the same time.

This story can be read many ways.  One way is to read it as saying what a
putz I am for writing a not very clever (or entirely too clever) program in
the first place.  Another reading is as another example of how disk and
memory have become more like tapes than like random access devices.  Another
reading is that the discipline of map-reduce is good for writing simple,
fast programs.

Regardless, I haven't looked back since that day.  If I have a batch program
of any scale at all, it goes into map-reduce form at my earliest
opportunity.

On Sat, Nov 21, 2009 at 10:44 PM, Sean Owen <[email protected]> wrote:

> think we can already state the answer though: it's going to take
> much more CPU time and resources to run a computation via Hadoop than
> run it completely on one machine (non-parallelized). Hadoop is a lot
> of overhead.
>
> ...
> Are you wondering how much the overhead is, of a framework like Hadoop?
>



-- 
Ted Dunning, CTO
DeepDyve

Reply via email to