Re: "Moving Computation is Cheaper than Moving Data"

Ted Dunning Thu, 23 Aug 2007 11:08:13 -0700

Look at how nutch distributes search.

Each search engine has an independent index that contains a fraction of the
documents.  Each one does every search and the results are merged.  Each
engine has an index of 5-10 million documents so a web scale index takes
100-1000 machines.


This is described in the Lucene book.


On 8/23/07 8:56 AM, "Samuel LEMOINE" <[EMAIL PROTECTED]> wrote:

> Thanks so much, it helps me a lot. I'm actually quite lost with Hadoop's
> mechanisms.
> The point of my study is to distribute the Lucene searching phase with
> Hadoop...
> According to what I'v understood, a way to distribute the search over a
> big Lucene's index would be to put this index on HDFS, and to implement
> the Lucene search job under the Mapper interface, am I right ?
> But I'm stuck because of Lucene searchable architecture... the
> IndexReader takes the whole path where's located the index as argument,
> I don't see how to distribute it...
> Well I guess this issue is quite different of the original subject of
> this thread, maybe should I post a new message about this issue.
> 
> 
> Arun C Murthy a écrit :
>> Samuel,
>> 
>> Samuel LEMOINE wrote:
>>> Well, I don't get it... when you pass arguments to a map job, you
>>> just give a key and a value, how can hadoop make the link between
>>> those arguments and the data's concerned? Really, your answer don't
>>> help me at all, sorry ^^
>>> 
>> 
>> The input of a map-reduce job is a file or a bunch of files. These
>> files are usually stored on HDFS, which splits up a logical file into
>> physical blocks of fixed size (configurable with default size of
>> 128MB). Each block is replicated for reliability.
>> 
>> The important point to note is that both the HDFS and Map-Reduce
>> clusters run on the same hardware i.e. a combined data and compute
>> cluster.
>> 
>> Now when you launch a job (with lots of maps and reduces) the inputs
>> file-sets are split into FileSplits (logical splits, user can control
>> the splitting). Now the framework schedules as many maps as there are
>> splits i.e. there is a one-to-one correspondence between maps and
>> splits and each map processes one input split.
>> 
>> The key idea is to try and *schedule* each map on the _datanode_ (i.e.
>> one among the set of datanodes) which contains the actual block for
>> the logical input-split that the map is supposed to process. This is
>> what we refer to as 'data-locality. Hence we move the computation (the
>> actual map) to the data (input split).
>> 
>> This is feasible due to:
>> a) HDFS & Map-Reduce share the same physical cluster.
>> b) HDFS exposes (via relevant apis) the underlying block-locations
>> where a file is physically stored on the file-system.
>> 
>> hth,
>> Arun
>> 
>> 
>> Essentially what Hadoop's map-reduce tries to do is to schedule *maps* on
>>> Devaraj Das a écrit :
>>> 
>>>> That's the paradigm of Hadoop's Map-Reduce.
>>>>  
>>>> 
>>>>> -----Original Message-----
>>>>> From: Samuel LEMOINE [mailto:[EMAIL PROTECTED] Sent:
>>>>> Thursday, August 23, 2007 2:48 PM
>>>>> To: [email protected]
>>>>> Subject: "Moving Computation is Cheaper than Moving Data"
>>>>> 
>>>>> When I read the Hadoop documentation:
>>>>> The Hadoop Distributed File System: Architecture and Design
>>>>> (http://lucene.apache.org/hadoop/hdfs_design.html)
>>>>> 
>>>>> a paragraph hold my attention:
>>>>> 
>>>>> 
>>>>>       "Moving Computation is Cheaper than Moving Data"
>>>>> 
>>>>> A computation requested by an application is much more efficient if
>>>>> it is executed near the data it operates on. This is especially
>>>>> true when the size of the data set is huge. This minimizes network
>>>>> congestion and increases the overall throughput of the system. The
>>>>> assumption is that it is often better to migrate the computation
>>>>> closer to where the data is located rather than moving the data to
>>>>> where the application is running. HDFS provides interfaces for
>>>>> applications to move themselves closer to where the data is located.
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> I'd like to know how to perform that, espacially with the aim of
>>>>> distributed Lucene search ? Which Hadoop classes should I use to do
>>>>> that ?
>>>>> 
>>>>> Thanks in advance,
>>>>> 
>>>>> Samuel
>>>>> 
>>>>>     
>>>> 
>>>> 
>>>> 
>>>>   
>>> 
>>> 
>> 
>> 
>

Re: "Moving Computation is Cheaper than Moving Data"

Reply via email to