Doug Cutting wrote:

The IndexOptimizer.java class in the searcher package was an old attempt to create something like what Suel calls "fancy postings". It creates an index with the top 10% scoring postings. Since documents are not renumbered one can intermix postings from this with the full index. So for example, one can first try searching using this index for terms that occur more than, e.g., 10k times, and use the full index for rarer words. If that does not find 1000 hits then the full index must be searched. Such an approach can be combined with using a pre-sorted index.


I tested the IndexOptimizer, comparing the result lists from the original and the optimized index.

The trick in the original IndexOptimizer to avoid copying field data doesn't work anymore - it throws exceptions during segment merging. I "fixed it" by commenting out overriden numDocs() and maxDoc() in OptimizingReader.

Then, after analyzing the explanations I came to conclusion that the IDFs are calculated based on the original ratios of docFreq/numDocs, so I needed to modify Similarity.idf() to account for the changed docFreq/numDocs (by FRACTION).

The results, speed-wise, were very encouraging - however, after comparing the hit lists I discovered that they differed significantly.

For single term queries (in Nutch - in Lucene they are rewritten to complex BooleanQueries), the hit lists are nearly identical for the first 10 hits, then they start to differ more and more as you progress along the original hit list. This is not so surprising - after all, this "optimization" operation is lossy. Still, the differences are higher than it was reported in that paper by Suel (but they used a different algorithm to select the postings) - Suel et al. were able to achieve 98% accuracy for the top-10 results, _including_ multi-term boolean queries.

For multi-term Nutch queries, which are rewritten to a combination of boolean queries and sloppy phrase queries, the effects are disastrous - I could barely manage to get some of the matching hits within the first 300 results, and their order was completely at odds with the original hit list. This is probably due to the scoring of sloppy phrases - I need to modify the test scripts to compare the explanations from matching results...

--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to