Byron Miller wrote:
On optimizing performance, does anyone know if google is exporting its entire dataset as an index or only somehow indexing the topN % (since they only show the first 1000 or so results anyway)
Both. The highest-scoring pages are kept in separate indexes that are searched first. When a query fails to match 1000 or so documents in the high-scoring indexes then the entire dataset is searched. In general there can be multiple levels, e.g.: high-scoring, mid-scoring and low-scoring indexes, with the vast majority of pages in the last category, and the vast majority of queries resolved consulting only the first category.
What I have implemented so far for Nutch is a single-index version of this. The current index-sorting implementation does not yet scale well to indexes larger than ~50M urls. It is a proof-of-concept.
A better long-term approach is to introduce another MapReduce pass that collects Lucene documents (or equivalent) as values, and page scores as keys. Then the indexing MapReduce pass can partition and sort by score before creating indexes. The distributed search code will also need to be modified to search high-score indexes first.
Doug ------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Do you grep through log files for problems? Stop! Download the new AJAX search engine that makes searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click _______________________________________________ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers