Re: [Nutch-general] How to go about building a large scale system

Doug Cutting Mon, 14 Jun 2004 16:32:04 -0700

Byron Miller wrote:

How best is it to segment your indices? Just split
thinks and setup a huge query farm that hopefully can
handle the load?


That's the intended method today.

Try and break things up based on the
PR of data and then as queries are happened you have
beefy high PR servers and scale down?

It might be interesting to try something like this. In theory this could provide efficiencies, but Nutch does not yet support it.

A related approach is to still distribute to the full set of indexes, but to sort postings in each index by link analysis score. This makes each node in the distributed system faster, so that fewer nodes are needed. This would use something like the approach proposed by Torsten Suel in http://cis.poly.edu/suel/papers/order.pdf. I started to implement a variant of this in IndexOptimizer.java, which divides an index into two buckets, the high scoring and the rest. I have an undebugged implementation of the search side of this that has not yet been committed. Someday I hope to have a chance to finish this...

How about sorting your data based on terms/words/data?

It is more complicated to do things this way, and it doesn't, in the long haul, scale as well. Inktomi used to do this. I have no idea whether they still do.

ANyone have any clue on how yahoo/google or any other major search system manages the query load, indices, updating of data and keeps a fast response time?

In Google's published reports, they appear to do approximately what Nutch does: broadcast the query to a large number of servers, each of which search a subset of the collection.

The intended update design for Nutch is to keep an offline copy of all of the indexes. New segments can be added there, and old segments can be removed. Duplicate detection and subsequent merging can be performed here. Once a new set of merged indexes is constructed, it can be copied to production machines. If you perform duplicate detection after merging, then your indexes will be slightly larger, but you'll only have to fully update those production machines whose segments are being replaced with new segments. Those machines which are still serving the same segments can just get a new copy of the Lucene deletions file. But if you insteaad perform duplicate detection before merging, then your indexes will be smaller, speeding search somewhat, but you'll have to update all production search machines. I hope this makes sense.

Management software to automate all of this is of course needed.

Doug

-------------------------------------------------------
This SF.Net email is sponsored by The 2004 JavaOne(SM) Conference
Learn from the experts at JavaOne(SM), Sun's Worldwide Java Developer
Conference, June 28 - July 1 at the Moscone Center in San Francisco, CA
REGISTER AND SAVE! http://java.sun.com/javaone/sf Priority Code NWMGYKND
_______________________________________________
Nutch-general mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] How to go about building a large scale system

Reply via email to