Re: map/reduce and Lucene integration question

Ted Dunning Thu, 13 Dec 2007 11:32:28 -0800


After indexing, indexes are moved to multiple query servers.  The indexes on
the local query servers are all on local disk.

There are two dimensions to scaling search.  The first dimension is query
rate.  To get that scaling, you simply replicate your basic search operator
and balance using a simple load balancer.

The second dimension is collection size.  If you have more than about 20
million documents, you need to have several machines cooperate in a search.
To scale in this dimension you have front end engines that do multi-searches
against farms that each scale in the first dimension using load balancing.
You need load balancing in front of your front end engines as well.

With this architecture, you get good scaling in both queries per second and
collection size and you maintain full HA.

On 12/13/07 11:18 AM, "Eugeny N Dzhurinsky" <[EMAIL PROTECTED]> wrote:

> On Thu, Dec 13, 2007 at 11:03:50AM -0800, Ted Dunning wrote:
>> 
>> I don't think so (but I don't run nutch)
>> 
>> To actually run searches, the search engines copy the index to local
>> storage.  Having them in HDFS is very nice, however, as a way to move them
>> to the right place.
> 
> Even in case if there is extremely fast network connection between nodes,
> moving indexes of several gigabytes of size seems to be very slow.
> 
> Is there any way to guarantee the request would be sent to certain data node
> which already holds required part of index, or guarantee the all reduce jobs
> will be running on same host and this way index will be located at the same
> host?
> 
> I feel like map/reduce is perfect way to index large set of documents, however
> I'm not sure how the searching will be performed later. I can think if the
> search request will be broadcasted to ALL nodes, each of node will take the
> search request, perform some search and return (or not) results which will be
> reduced later, however as far as I can see Hadoop will send the request to
> first node which seems to be free - but not necessary the same node which
> holds the index suitable for this request?

Re: map/reduce and Lucene integration question

Reply via email to