Is there some statistics available to monitor which percentage of the
pairs remains in memory, and which percentage was written to disk?
Or which are these exceptional cases that you mention?
Hadoop goes to some lengths to make sure that things can stay in
memory as
much as possible. There are still cases, however, where intermediate
results are normally written to disk. That means that implementors
will
have those time scales in their head as they do things which will
inevitably
make the trade-offs somewhat poor compared to a system that never
envisions
intermediate data being written to disk.
But other than guessing like this, I couldn't actually say how it
would turn
out except that for very short jobs, moving jar files around and other
startup costs can be the dominant cost.
On Sun, Jun 1, 2008 at 5:05 AM, Martin Jaggi <[EMAIL PROTECTED]>
wrote:
So in the case that all intermediate pairs fit into the RAM of the
cluster,
does the InMemoryFileSystem already allow the intermediate phase to
be done
without much disk access? Or what would be the current bottleneck
in Hadoop
in this scenario (huge computational load, not so much data in/out)
according to your opinion?