[Nutch-dev] IndexOptimizer (Re: Lucene performance bottlenecks)

Andrzej Bialecki Mon, 12 Dec 2005 08:34:06 -0800

Doug Cutting wrote:

The IndexOptimizer.java class in the searcher package was an oldattempt to create something like what Suel calls "fancy postings". Itcreates an index with the top 10% scoring postings. Since documentsare not renumbered one can intermix postings from this with the fullindex. So for example, one can first try searching using this indexfor terms that occur more than, e.g., 10k times, and use the fullindex for rarer words. If that does not find 1000 hits then the fullindex must be searched. Such an approach can be combined with using apre-sorted index.

I tested the IndexOptimizer, comparing the result lists from theoriginal and the optimized index.

The trick in the original IndexOptimizer to avoid copying field datadoesn't work anymore - it throws exceptions during segment merging. I"fixed it" by commenting out overriden numDocs() and maxDoc() inOptimizingReader.

Then, after analyzing the explanations I came to conclusion that theIDFs are calculated based on the original ratios of docFreq/numDocs, soI needed to modify Similarity.idf() to account for the changeddocFreq/numDocs (by FRACTION).

The results, speed-wise, were very encouraging - however, aftercomparing the hit lists I discovered that they differed significantly.

For single term queries (in Nutch - in Lucene they are rewritten tocomplex BooleanQueries), the hit lists are nearly identical for thefirst 10 hits, then they start to differ more and more as you progressalong the original hit list. This is not so surprising - after all, this"optimization" operation is lossy. Still, the differences are higherthan it was reported in that paper by Suel (but they used a differentalgorithm to select the postings) - Suel et al. were able to achieve 98%accuracy for the top-10 results, _including_ multi-term boolean queries.

For multi-term Nutch queries, which are rewritten to a combination ofboolean queries and sloppy phrase queries, the effects are disastrous -I could barely manage to get some of the matching hits within the first300 results, and their order was completely at odds with the originalhit list. This is probably due to the scoring of sloppy phrases - I needto modify the test scripts to compare the explanations from matchingresults...


--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] IndexOptimizer (Re: Lucene performance bottlenecks)

Reply via email to