Samuel,
Samuel LEMOINE wrote:
Well, I don't get it... when you pass arguments to a map job, you just
give a key and a value, how can hadoop make the link between those
arguments and the data's concerned? Really, your answer don't help me at
all, sorry ^^
The input of a map-reduce job is a file or a bunch of files. These files
are usually stored on HDFS, which splits up a logical file into physical
blocks of fixed size (configurable with default size of 128MB). Each
block is replicated for reliability.
The important point to note is that both the HDFS and Map-Reduce
clusters run on the same hardware i.e. a combined data and compute cluster.
Now when you launch a job (with lots of maps and reduces) the inputs
file-sets are split into FileSplits (logical splits, user can control
the splitting). Now the framework schedules as many maps as there are
splits i.e. there is a one-to-one correspondence between maps and splits
and each map processes one input split.
The key idea is to try and *schedule* each map on the _datanode_ (i.e.
one among the set of datanodes) which contains the actual block for the
logical input-split that the map is supposed to process. This is what we
refer to as 'data-locality. Hence we move the computation (the actual
map) to the data (input split).
This is feasible due to:
a) HDFS & Map-Reduce share the same physical cluster.
b) HDFS exposes (via relevant apis) the underlying block-locations where
a file is physically stored on the file-system.
hth,
Arun
Essentially what Hadoop's map-reduce tries to do is to schedule *maps* on
Devaraj Das a écrit :
That's the paradigm of Hadoop's Map-Reduce.
-----Original Message-----
From: Samuel LEMOINE [mailto:[EMAIL PROTECTED] Sent:
Thursday, August 23, 2007 2:48 PM
To: [email protected]
Subject: "Moving Computation is Cheaper than Moving Data"
When I read the Hadoop documentation:
The Hadoop Distributed File System: Architecture and Design
(http://lucene.apache.org/hadoop/hdfs_design.html)
a paragraph hold my attention:
"Moving Computation is Cheaper than Moving Data"
A computation requested by an application is much more efficient if
it is executed near the data it operates on. This is especially true
when the size of the data set is huge. This minimizes network
congestion and increases the overall throughput of the system. The
assumption is that it is often better to migrate the computation
closer to where the data is located rather than moving the data to
where the application is running. HDFS provides interfaces for
applications to move themselves closer to where the data is located.
I'd like to know how to perform that, espacially with the aim of
distributed Lucene search ? Which Hadoop classes should I use to do
that ?
Thanks in advance,
Samuel