Doug Cutting wrote:
Byron Miller wrote:
On optimizing performance, does anyone know if google
is exporting its entire dataset as an index or only
somehow indexing the topN % (since they only show the
first 1000 or so results anyway)
Both. The highest-scoring pages are kept in separate indexes that are
searched first. When a query fails to match 1000 or so documents in
the high-scoring indexes then the entire dataset is searched. In
general there can be multiple levels, e.g.: high-scoring, mid-scoring
and low-scoring indexes, with the vast majority of pages in the last
category, and the vast majority of queries resolved consulting only
the first category.
What I have implemented so far for Nutch is a single-index version of
this. The current index-sorting implementation does not yet scale
well to indexes larger than ~50M urls. It is a proof-of-concept.
A better long-term approach is to introduce another MapReduce pass
that collects Lucene documents (or equivalent) as values, and page
scores as keys. Then the indexing MapReduce pass can partition and
sort by score before creating indexes. The distributed search code
will also need to be modified to search high-score indexes first.
The WWW2005 conference presented a couple of interesting papers on the
subject (http://www2005.org), among others these:
1. http://www2005.org/cdrom/docs/p235.pdf
2. http://www2005.org/cdrom/docs/p245.pdf
3. http://www2005.org/cdrom/docs/p257.pdf
The techniques described in the first paper are not too difficult to
implement, especially the Carmel's method of index pruning, which gives
satisfactory results at moderate costs.
The third paper, by Long & Suel, presents a concept of using a cache of
intersections for multi-term queries, which we already sort of use with
CachingFilters, only they propose to store them on-disk instead of
limiting the cache to relatively small number of filters kept in RAM...
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems? Stop! Download the new AJAX search engine that makes
searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers