Look at how nutch distributes search. Each search engine has an independent index that contains a fraction of the documents. Each one does every search and the results are merged. Each engine has an index of 5-10 million documents so a web scale index takes 100-1000 machines.
This is described in the Lucene book. On 8/23/07 8:56 AM, "Samuel LEMOINE" <[EMAIL PROTECTED]> wrote: > Thanks so much, it helps me a lot. I'm actually quite lost with Hadoop's > mechanisms. > The point of my study is to distribute the Lucene searching phase with > Hadoop... > According to what I'v understood, a way to distribute the search over a > big Lucene's index would be to put this index on HDFS, and to implement > the Lucene search job under the Mapper interface, am I right ? > But I'm stuck because of Lucene searchable architecture... the > IndexReader takes the whole path where's located the index as argument, > I don't see how to distribute it... > Well I guess this issue is quite different of the original subject of > this thread, maybe should I post a new message about this issue. > > > Arun C Murthy a écrit : >> Samuel, >> >> Samuel LEMOINE wrote: >>> Well, I don't get it... when you pass arguments to a map job, you >>> just give a key and a value, how can hadoop make the link between >>> those arguments and the data's concerned? Really, your answer don't >>> help me at all, sorry ^^ >>> >> >> The input of a map-reduce job is a file or a bunch of files. These >> files are usually stored on HDFS, which splits up a logical file into >> physical blocks of fixed size (configurable with default size of >> 128MB). Each block is replicated for reliability. >> >> The important point to note is that both the HDFS and Map-Reduce >> clusters run on the same hardware i.e. a combined data and compute >> cluster. >> >> Now when you launch a job (with lots of maps and reduces) the inputs >> file-sets are split into FileSplits (logical splits, user can control >> the splitting). Now the framework schedules as many maps as there are >> splits i.e. there is a one-to-one correspondence between maps and >> splits and each map processes one input split. >> >> The key idea is to try and *schedule* each map on the _datanode_ (i.e. >> one among the set of datanodes) which contains the actual block for >> the logical input-split that the map is supposed to process. This is >> what we refer to as 'data-locality. Hence we move the computation (the >> actual map) to the data (input split). >> >> This is feasible due to: >> a) HDFS & Map-Reduce share the same physical cluster. >> b) HDFS exposes (via relevant apis) the underlying block-locations >> where a file is physically stored on the file-system. >> >> hth, >> Arun >> >> >> Essentially what Hadoop's map-reduce tries to do is to schedule *maps* on >>> Devaraj Das a écrit : >>> >>>> That's the paradigm of Hadoop's Map-Reduce. >>>> >>>> >>>>> -----Original Message----- >>>>> From: Samuel LEMOINE [mailto:[EMAIL PROTECTED] Sent: >>>>> Thursday, August 23, 2007 2:48 PM >>>>> To: [email protected] >>>>> Subject: "Moving Computation is Cheaper than Moving Data" >>>>> >>>>> When I read the Hadoop documentation: >>>>> The Hadoop Distributed File System: Architecture and Design >>>>> (http://lucene.apache.org/hadoop/hdfs_design.html) >>>>> >>>>> a paragraph hold my attention: >>>>> >>>>> >>>>> "Moving Computation is Cheaper than Moving Data" >>>>> >>>>> A computation requested by an application is much more efficient if >>>>> it is executed near the data it operates on. This is especially >>>>> true when the size of the data set is huge. This minimizes network >>>>> congestion and increases the overall throughput of the system. The >>>>> assumption is that it is often better to migrate the computation >>>>> closer to where the data is located rather than moving the data to >>>>> where the application is running. HDFS provides interfaces for >>>>> applications to move themselves closer to where the data is located. >>>>> >>>>> >>>>> >>>>> >>>>> I'd like to know how to perform that, espacially with the aim of >>>>> distributed Lucene search ? Which Hadoop classes should I use to do >>>>> that ? >>>>> >>>>> Thanks in advance, >>>>> >>>>> Samuel >>>>> >>>>> >>>> >>>> >>>> >>>> >>> >>> >> >> >
