Thanks for your comments!
So in the case that all intermediate pairs fit into the RAM of the
cluster, does the InMemoryFileSystem already allow the intermediate
phase to be done without much disk access? Or what would be the
current bottleneck in Hadoop in this scenario (huge computational
load, not so much data in/out) according to your opinion?
Am 01.06.2008 um 10:08 schrieb Ted Dunning:
Hadoop is highly optimized towards handling datasets that are much
too large
to fit into memory. That means that there are many trade-offs that
have
been made that make it much less useful for very short jobs or jobs
that
would fit into memory easily.
Multi-core implementations of map-reduce are very interesting for a
number
of applications as are in-memory implementations for distributed
architectures. I don't think that anybody really knows yet how well
these
other implementations will play with Hadoop. The regimes that they
are
designed to optimize are very different in terms of data scale,
number of
machines and networking speed. All of these constraints drive the
design in
innumerable ways.
On Sat, May 31, 2008 at 7:51 PM, Martin Jaggi <[EMAIL PROTECTED]>
wrote:
Concerning real-time Map Reduce within (and not only between)
machines
(multi-core & GPU), e.g. the Phoenix and Mars frameworks:
I'm really interested in very fast Map Reduce tasks, i.e. without
much disk
access. With the rise of multi-core systems, this could get more
and more
interesting, and could maybe even lead to something like 'super-
computing
for everyone', or is that a bit overwhelming? Anyway I was nicely
surprised
to see the recent Phoenix (http://csl.stanford.edu/~christos/sw/phoenix/
)
implementation of Map Reduce for multi-core CPUs (they won the best
paper
award at HPCA'07).
Recently also GPU computing was in the news again, pushed by Nvidia
(check
CUDA http://www.nvidia.com/object/cuda_showcase.html ), and now also
there a Map Reduce implementation called Mars became available:
http://www.cse.ust.hk/gpuqp/Mars_tr.pdf
The Mars people say a the end of their paper "We are also
interested in
integrating Mars into the existing Map Reduce implementations such
as Hadoop
so that the Map Reduce framework can take the advantage of the
parallelism
among different machines as well as the parallelism within each
machine."
What do you think of this, especially about the multi-core
approach? Do you
think these needs are already served by the current
InMemoryFileSystem of
Hadoop or not? Are there any plans of 'integrating' one of the two
above
frameworks?
Or would it already be done by improving the significant
intermediate data
pairs overhead (https://issues.apache.org/jira/browse/HADOOP-3366 )?
Any comments?
--
ted