Great reading and great ideas. In such a system where you have say 3 segment partitions is it possible to build a mapreduce job to efficiently fetch, retreive and update these segments?
Use a map job to process a segment for deletion and somehow process that segment to create a new fetchlist from so that you only fetch data that isn't already deduped because of already being fetched or duplicated in another non aged out segment. just brainstorming :) --- Andrzej Bialecki <[EMAIL PROTECTED]> wrote: > Doug Cutting wrote: > > > Byron Miller wrote: > > > >> On optimizing performance, does anyone know if > google > >> is exporting its entire dataset as an index or > only > >> somehow indexing the topN % (since they only show > the > >> first 1000 or so results anyway) > > > > > > Both. The highest-scoring pages are kept in > separate indexes that are > > searched first. When a query fails to match 1000 > or so documents in > > the high-scoring indexes then the entire dataset > is searched. In > > general there can be multiple levels, e.g.: > high-scoring, mid-scoring > > and low-scoring indexes, with the vast majority of > pages in the last > > category, and the vast majority of queries > resolved consulting only > > the first category. > > > > What I have implemented so far for Nutch is a > single-index version of > > this. The current index-sorting implementation > does not yet scale > > well to indexes larger than ~50M urls. It is a > proof-of-concept. > > > > A better long-term approach is to introduce > another MapReduce pass > > that collects Lucene documents (or equivalent) as > values, and page > > scores as keys. Then the indexing MapReduce pass > can partition and > > sort by score before creating indexes. The > distributed search code > > will also need to be modified to search high-score > indexes first. > > > The WWW2005 conference presented a couple of > interesting papers on the > subject (http://www2005.org), among others these: > > 1. http://www2005.org/cdrom/docs/p235.pdf > 2. http://www2005.org/cdrom/docs/p245.pdf > 3. http://www2005.org/cdrom/docs/p257.pdf > > The techniques described in the first paper are not > too difficult to > implement, especially the Carmel's method of index > pruning, which gives > satisfactory results at moderate costs. > > The third paper, by Long & Suel, presents a concept > of using a cache of > intersections for multi-term queries, which we > already sort of use with > CachingFilters, only they propose to store them > on-disk instead of > limiting the cache to relatively small number of > filters kept in RAM... > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ > __________________________________ > [__ || __|__/|__||\/| Information Retrieval, > Semantic Web > ___|||__|| \| || | Embedded Unix, System > Integration > http://www.sigram.com Contact: info at sigram dot > com > > >
