Great reading and great ideas.

In such a system where you have say 3 segment
partitions is it possible to build a mapreduce job to
efficiently fetch, retreive and update these segments?

Use a map job to process a segment for deletion and
somehow process that segment to create a new fetchlist
from so that you  only fetch data that isn't already
deduped because of already being fetched or duplicated
in another non aged out segment.

just brainstorming :) 



--- Andrzej Bialecki <[EMAIL PROTECTED]> wrote:

> Doug Cutting wrote:
> 
> > Byron Miller wrote:
> >
> >> On optimizing performance, does anyone know if
> google
> >> is exporting its entire dataset as an index or
> only
> >> somehow indexing the topN % (since they only show
> the
> >> first 1000 or so results anyway)
> >
> >
> > Both.  The highest-scoring pages are kept in
> separate indexes that are 
> > searched first.  When a query fails to match 1000
> or so documents in 
> > the high-scoring indexes then the entire dataset
> is searched.  In 
> > general there can be multiple levels, e.g.:
> high-scoring, mid-scoring 
> > and low-scoring indexes, with the vast majority of
> pages in the last 
> > category, and the vast majority of queries
> resolved consulting only 
> > the first category.
> >
> > What I have implemented so far for Nutch is a
> single-index version of 
> > this.  The current index-sorting implementation
> does not yet scale 
> > well to indexes larger than ~50M urls.  It is a
> proof-of-concept.
> >
> > A better long-term approach is to introduce
> another MapReduce pass 
> > that collects Lucene documents (or equivalent) as
> values, and page 
> > scores as keys.  Then the indexing MapReduce pass
> can partition and 
> > sort by score before creating indexes.  The
> distributed search code 
> > will also need to be modified to search high-score
> indexes first.
> 
> 
> The WWW2005 conference presented a couple of
> interesting papers on the 
> subject (http://www2005.org), among others these:
> 
> 1. http://www2005.org/cdrom/docs/p235.pdf
> 2. http://www2005.org/cdrom/docs/p245.pdf
> 3. http://www2005.org/cdrom/docs/p257.pdf
> 
> The techniques described in the first paper are not
> too difficult to 
> implement, especially the Carmel's method of index
> pruning, which gives 
> satisfactory results at moderate costs.
> 
> The third paper, by Long & Suel, presents a concept
> of using a cache of 
> intersections for multi-term queries, which we
> already sort of use with 
> CachingFilters, only they propose to store them
> on-disk instead of 
> limiting the cache to relatively small number of
> filters kept in RAM...
> 
> -- 
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _  
> __________________________________
> [__ || __|__/|__||\/|  Information Retrieval,
> Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System
> Integration
> http://www.sigram.com  Contact: info at sigram dot
> com
> 
> 
> 

Reply via email to