On Thu, Dec 13, 2007 at 11:03:50AM -0800, Ted Dunning wrote: > > I don't think so (but I don't run nutch) > > To actually run searches, the search engines copy the index to local > storage. Having them in HDFS is very nice, however, as a way to move them > to the right place.
Even in case if there is extremely fast network connection between nodes, moving indexes of several gigabytes of size seems to be very slow. Is there any way to guarantee the request would be sent to certain data node which already holds required part of index, or guarantee the all reduce jobs will be running on same host and this way index will be located at the same host? I feel like map/reduce is perfect way to index large set of documents, however I'm not sure how the searching will be performed later. I can think if the search request will be broadcasted to ALL nodes, each of node will take the search request, perform some search and return (or not) results which will be reduced later, however as far as I can see Hadoop will send the request to first node which seems to be free - but not necessary the same node which holds the index suitable for this request? -- Eugene N Dzhurinsky
pgpsngonG9gPA.pgp
Description: PGP signature