Great reading and great ideas. In such a system where you have say 3 segment partitions is it possible to build a mapreduce job to efficiently fetch, retreive and update these segments?
Use a map job to process a segment for deletion and somehow process that segment to create a new fetchlist from so that you only fetch data that isn't already deduped because of already being fetched or duplicated in another non aged out segment. just brainstorming :) --- Andrzej Bialecki <[EMAIL PROTECTED]> wrote: > Doug Cutting wrote: > > > Byron Miller wrote: > > > >> On optimizing performance, does anyone know if > google > >> is exporting its entire dataset as an index or > only > >> somehow indexing the topN % (since they only show > the > >> first 1000 or so results anyway) > > > > > > Both. The highest-scoring pages are kept in > separate indexes that are > > searched first. When a query fails to match 1000 > or so documents in > > the high-scoring indexes then the entire dataset > is searched. In > > general there can be multiple levels, e.g.: > high-scoring, mid-scoring > > and low-scoring indexes, with the vast majority of > pages in the last > > category, and the vast majority of queries > resolved consulting only > > the first category. > > > > What I have implemented so far for Nutch is a > single-index version of > > this. The current index-sorting implementation > does not yet scale > > well to indexes larger than ~50M urls. It is a > proof-of-concept. > > > > A better long-term approach is to introduce > another MapReduce pass > > that collects Lucene documents (or equivalent) as > values, and page > > scores as keys. Then the indexing MapReduce pass > can partition and > > sort by score before creating indexes. The > distributed search code > > will also need to be modified to search high-score > indexes first. > > > The WWW2005 conference presented a couple of > interesting papers on the > subject (http://www2005.org), among others these: > > 1. http://www2005.org/cdrom/docs/p235.pdf > 2. http://www2005.org/cdrom/docs/p245.pdf > 3. http://www2005.org/cdrom/docs/p257.pdf > > The techniques described in the first paper are not > too difficult to > implement, especially the Carmel's method of index > pruning, which gives > satisfactory results at moderate costs. > > The third paper, by Long & Suel, presents a concept > of using a cache of > intersections for multi-term queries, which we > already sort of use with > CachingFilters, only they propose to store them > on-disk instead of > limiting the cache to relatively small number of > filters kept in RAM... > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ > __________________________________ > [__ || __|__/|__||\/| Information Retrieval, > Semantic Web > ___|||__|| \| || | Embedded Unix, System > Integration > http://www.sigram.com Contact: info at sigram dot > com > > > ------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Do you grep through log files for problems? Stop! Download the new AJAX search engine that makes searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click _______________________________________________ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers