--- Doug Cutting <[EMAIL PROTECTED]> wrote: > Byron Miller wrote: > > How best is it to segment your indices? Just split > > thinks and setup a huge query farm that hopefully > can > > handle the load? > > That's the intended method today.
As part of my re-work of servers.txt into an XML file, i'm trying to think of ways as to map some type of design into what each server does. This way i can manage servers.txt/xml to define WebDB servers as well as Index servers (and data mapping/directory binding to each). > > > Try and break things up based on the > > PR of data and then as queries are happened you > have > > beefy high PR servers and scale down? > > It might be interesting to try something like this. > In theory this > could provide efficiencies, but Nutch does not yet > support it. > > A related approach is to still distribute to the > full set of indexes, > but to sort postings in each index by link analysis > score. This makes > each node in the distributed system faster, so that > fewer nodes are > needed. This would use something like the approach > proposed by Torsten > Suel in http://cis.poly.edu/suel/papers/order.pdf. > I started to > implement a variant of this in IndexOptimizer.java, > which divides an > index into two buckets, the high scoring and the > rest. I have an > undebugged implementation of the search side of this > that has not yet > been committed. Someday I hope to have a chance to > finish this... That is an awesome idea! Could this be a method of index merge so that when you merge you could combine this as well as call this procedure directly? Something like this would be nice to design a cache interface into. If you can cache 90% of your high PR and use the rest for hit/miss/reload of low PR queries you could really reduce server loads/requirements and tweak the systems really effeciently. (Verse having to cache entire index segments to catch a high ranking doc at the last record) > > How about sorting your data based on > terms/words/data? > > It is more complicated to do things this way, and it > doesn't, in the > long haul, scale as well. Inktomi used to do this. > I have no idea > whether they still do. After a quick thought process i realized what a PITA that would be as well :) > > ANyone have any clue on how yahoo/google or any > other > > major search system manages the query load, > indices, > > updating of data and keeps a fast response time? > > In Google's published reports, they appear to do > approximately what > Nutch does: broadcast the query to a large number of > servers, each of > which search a subset of the collection. I guess from my experience with building a large corpus it is managing this susbset that concerns me. I'm thinking of building an "instance" configuration file that could define fetcher runs, index sizes and such to better create a uniform create, analyze,generate, fetch, merge, index, analyze, generate, fetch, merge process that can be monitored/managed. > The intended update design for Nutch is to keep an > offline copy of all > of the indexes. New segments can be added there, > and old segments can > be removed. Duplicate detection and subsequent > merging can be performed > here. Once a new set of merged indexes is > constructed, it can be copied > to production machines. If you perform duplicate > detection after > merging, then your indexes will be slightly larger, > but you'll only have > to fully update those production machines whose > segments are being > replaced with new segments. Those machines which > are still serving the > same segments can just get a new copy of the Lucene > deletions file. But > if you insteaad perform duplicate detection before > merging, then your > indexes will be smaller, speeding search somewhat, > but you'll have to > update all production search machines. I hope this > makes sense. Makes sense, another reason i need a uniform mapping an allocation scheme. Would using a distributed fs to allocate the deleted urls work or would something be out of phase? > > Management software to automate all of this is of > course needed. > Amen to that :) ------------------------------------------------------- This SF.Net email is sponsored by The 2004 JavaOne(SM) Conference Learn from the experts at JavaOne(SM), Sun's Worldwide Java Developer Conference, June 28 - July 1 at the Moscone Center in San Francisco, CA REGISTER AND SAVE! http://java.sun.com/javaone/sf Priority Code NWMGYKND _______________________________________________ Nutch-general mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-general
