Re: [Nutch-general] How to go about building a large scale system

Byron Miller Wed, 16 Jun 2004 12:17:37 -0700

--- Doug Cutting <[EMAIL PROTECTED]> wrote:
> Byron Miller wrote:
> > How best is it to segment your indices? Just split
> > thinks and setup a huge query farm that hopefully
> can
> > handle the load?
> 
> That's the intended method today.


As part of my re-work of servers.txt into an XML file,
i'm trying to think of ways as to map some type of
design into what each server does.

This way i can manage servers.txt/xml to define WebDB
servers as well as Index servers (and data
mapping/directory binding to each).

> 
> > Try and break things up based on the
> > PR of data and then as queries are happened you
> have
> > beefy high PR servers and scale down?
> 
> It might be interesting to try something like this. 
> In theory this 
> could provide efficiencies, but Nutch does not yet
> support it.
> 
> A related approach is to still distribute to the
> full set of indexes, 
> but to sort postings in each index by link analysis
> score.  This makes 
> each node in the distributed system faster, so that
> fewer nodes are 
> needed.  This would use something like the approach
> proposed by Torsten 
> Suel in http://cis.poly.edu/suel/papers/order.pdf. 
> I started to 
> implement a variant of this in IndexOptimizer.java,
> which divides an 
> index into two buckets, the high scoring and the
> rest.  I have an 
> undebugged implementation of the search side of this
> that has not yet 
> been committed.  Someday I hope to have a chance to
> finish this...

That is an awesome idea!  Could this be a method of
index merge so that when you merge you could combine
this as well as call this procedure directly?

Something like this would be nice to design a cache
interface into.  If you can cache 90% of your high PR
and use the rest for hit/miss/reload of low PR queries
you could really reduce server loads/requirements and
tweak the systems really effeciently. (Verse having to
cache entire index segments to catch a high ranking
doc at the last record)

> > How about sorting your data based on
> terms/words/data?
> 
> It is more complicated to do things this way, and it
> doesn't, in the 
> long haul, scale as well.  Inktomi used to do this. 
> I have no idea 
> whether they still do.

After a quick thought process i realized what a PITA
that would be as well :)

> > ANyone have any clue on how yahoo/google or any
> other
> > major search system manages the query load,
> indices,
> > updating of data and keeps a fast response time?  
> 
> In Google's published reports, they appear to do
> approximately what 
> Nutch does: broadcast the query to a large number of
> servers, each of 
> which search a subset of the collection.

I guess from my experience with building a large
corpus it is managing this susbset that concerns me. 
I'm thinking of building an "instance" configuration
file that could define fetcher runs, index sizes and
such to better create a uniform create,
analyze,generate, fetch, merge, index, analyze,
generate, fetch, merge process that can be
monitored/managed.

> The intended update design for Nutch is to keep an
> offline copy of all 
> of the indexes.  New segments can be added there,
> and old segments can 
> be removed.  Duplicate detection and subsequent
> merging can be performed 
> here.  Once a new set of merged indexes is
> constructed, it can be copied 
> to production machines.  If you perform duplicate
> detection after 
> merging, then your indexes will be slightly larger,
> but you'll only have 
> to fully update those production machines whose
> segments are being 
> replaced with new segments.  Those machines which
> are still serving the 
> same segments can just get a new copy of the Lucene
> deletions file.  But 
> if you insteaad perform duplicate detection before
> merging, then your 
> indexes will be smaller, speeding search somewhat,
> but you'll have to 
> update all production search machines.  I hope this
> makes sense.

Makes sense, another reason i need a uniform mapping
an allocation scheme.  

Would using a distributed fs to allocate the deleted
urls work or would something be out of phase?

> 
> Management software to automate all of this is of
> course needed.
> 

Amen to that :)


-------------------------------------------------------
This SF.Net email is sponsored by The 2004 JavaOne(SM) Conference
Learn from the experts at JavaOne(SM), Sun's Worldwide Java Developer
Conference, June 28 - July 1 at the Moscone Center in San Francisco, CA
REGISTER AND SAVE! http://java.sun.com/javaone/sf Priority Code NWMGYKND
_______________________________________________
Nutch-general mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] How to go about building a large scale system

Reply via email to