Re: "Moving Computation is Cheaper than Moving Data"

Enis Soztutar Fri, 24 Aug 2007 00:46:11 -0700

Hi,

You should not consider implementing distributed search over map/reduce.The paradigm is suitable for index searching. Lucene is already quiteefficient for millions of documents on one node. As suggested you canhave a look at nutch's distributed search architecture. It is notperfect at all but will be a good starting point.


Samuel LEMOINE wrote:

Thanks so much, it helps me a lot. I'm actually quite lost withHadoop's mechanisms.The point of my study is to distribute the Lucene searching phase withHadoop...According to what I'v understood, a way to distribute the search overa big Lucene's index would be to put this index on HDFS, and toimplement the Lucene search job under the Mapper interface, am I right ?But I'm stuck because of Lucene searchable architecture... theIndexReader takes the whole path where's located the index asargument, I don't see how to distribute it...Well I guess this issue is quite different of the original subject ofthis thread, maybe should I post a new message about this issue.
Arun C Murthy a écrit :
Samuel,

Samuel LEMOINE wrote:
Well, I don't get it... when you pass arguments to a map job, youjust give a key and a value, how can hadoop make the link betweenthose arguments and the data's concerned? Really, your answer don'thelp me at all, sorry ^^
The input of a map-reduce job is a file or a bunch of files. Thesefiles are usually stored on HDFS, which splits up a logical file intophysical blocks of fixed size (configurable with default size of128MB). Each block is replicated for reliability.
The important point to note is that both the HDFS and Map-Reduceclusters run on the same hardware i.e. a combined data and computecluster.
Now when you launch a job (with lots of maps and reduces) the inputsfile-sets are split into FileSplits (logical splits, user can controlthe splitting). Now the framework schedules as many maps as there aresplits i.e. there is a one-to-one correspondence between maps andsplits and each map processes one input split.
The key idea is to try and *schedule* each map on the _datanode_(i.e. one among the set of datanodes) which contains the actual blockfor the logical input-split that the map is supposed to process. Thisis what we refer to as 'data-locality. Hence we move the computation(the actual map) to the data (input split).
This is feasible due to:
a) HDFS & Map-Reduce share the same physical cluster.
b) HDFS exposes (via relevant apis) the underlying block-locationswhere a file is physically stored on the file-system.
hth,
Arun
Essentially what Hadoop's map-reduce tries to do is to schedule*maps* on
Devaraj Das a écrit :
That's the paradigm of Hadoop's Map-Reduce.
-----Original Message-----
From: Samuel LEMOINE [mailto:[EMAIL PROTECTED] Sent:Thursday, August 23, 2007 2:48 PM
To: [email protected]
Subject: "Moving Computation is Cheaper than Moving Data"

When I read the Hadoop documentation:
The Hadoop Distributed File System: Architecture and Design
(http://lucene.apache.org/hadoop/hdfs_design.html)

a paragraph hold my attention:


      "Moving Computation is Cheaper than Moving Data"
A computation requested by an application is much more efficientif it is executed near the data it operates on. This is especiallytrue when the size of the data set is huge. This minimizes networkcongestion and increases the overall throughput of the system. Theassumption is that it is often better to migrate the computationcloser to where the data is located rather than moving the data towhere the application is running. HDFS provides interfaces forapplications to move themselves closer to where the data is located.
I'd like to know how to perform that, espacially with the aim ofdistributed Lucene search ? Which Hadoop classes should I use todo that ?
Thanks in advance,

Samuel

Re: "Moving Computation is Cheaper than Moving Data"

Reply via email to