[Nutch-dev] Re: IndexSorter optimizer

Andrzej Bialecki Wed, 04 Jan 2006 10:25:12 -0800

Doug Cutting wrote:

Byron Miller wrote:
On optimizing performance, does anyone know if google
is exporting its entire dataset as an index or only
somehow indexing the topN % (since they only show the
first 1000 or so results anyway)
Both. The highest-scoring pages are kept in separate indexes that aresearched first. When a query fails to match 1000 or so documents inthe high-scoring indexes then the entire dataset is searched. Ingeneral there can be multiple levels, e.g.: high-scoring, mid-scoringand low-scoring indexes, with the vast majority of pages in the lastcategory, and the vast majority of queries resolved consulting onlythe first category.
What I have implemented so far for Nutch is a single-index version ofthis. The current index-sorting implementation does not yet scalewell to indexes larger than ~50M urls. It is a proof-of-concept.
A better long-term approach is to introduce another MapReduce passthat collects Lucene documents (or equivalent) as values, and pagescores as keys. Then the indexing MapReduce pass can partition andsort by score before creating indexes. The distributed search codewill also need to be modified to search high-score indexes first.

The WWW2005 conference presented a couple of interesting papers on thesubject (http://www2005.org), among others these:


1. http://www2005.org/cdrom/docs/p235.pdf
2. http://www2005.org/cdrom/docs/p245.pdf
3. http://www2005.org/cdrom/docs/p257.pdf

The techniques described in the first paper are not too difficult toimplement, especially the Carmel's method of index pruning, which givessatisfactory results at moderate costs.

The third paper, by Long & Suel, presents a concept of using a cache ofintersections for multi-term queries, which we already sort of use withCachingFilters, only they propose to store them on-disk instead oflimiting the cache to relatively small number of filters kept in RAM...


--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] Re: IndexSorter optimizer

Reply via email to