[Nutch-dev] IndexSorter optimizer

Andrzej Bialecki Wed, 21 Dec 2005 05:16:04 -0800

Hi,

I'm happy to report that further tests performed on a larger index seemto show that the overall impact of the IndexSorter is definitelypositive: performance improvements are significant, and the overallquality of results seems at least comparable, if not actually better.

The reason why result quality seems better is quite interesting, and itshows that the simple top-N measures that I was using in my benchmarksmay have been too simplistic.

Using the original index, it was possible for pages with high tf/idf ofa term, but with a low "boost" value (the OPIC score), to outrank pageswith high "boost" but lower tf/idf of a term. This phenomenon leadsquite often to results that are perceived as "junk", e.g. pages with alot of repeated terms, but with little other real content, like forexample navigation bars.

Using the optimized index, I reported previously that some of thetop-scoring results were missing. As it happens, the missing resultswere typically the "junk" pages with high tf/idf but low "boost". Sincewe collect up to N hits, going from higher to lower "boost" values, the"junk" pages with low "boost" value were automatically eliminated. So,overall the subjective quality of results was improved. On the otherhand, some of the legitimate results with a decent "boost" values werealso skipped because they didn't fit within the fixed number of hits...ah, well. Perhaps we should limit the number of hits in LimitedCollectorusing a cutoff "boost" value, and not the maximum number of hits (ormaybe both?).

This again brings to attention the importance of the OPIC score: itrepresents a query-independent opinion about the quality of the page -whichever way you calculate it. If you use PageRank, it (allegedly)corresponds to other people's opinions about the page, thus providing an"objective" quality opinion. If you use a simple list ofwhite/black-listed sites that you like/dislike, then it represents yourown subjective opinion on the quality of the site; etc, etc... In thisway, running a search engine that provides "good" results is not just aplain precision, recall, tf/idf and other tangible measures, it's also asort of political statement of the engine's operator. ;-)

To conclude, I will add the IndexSorter.java to the core classes, and Isuggest to continue the experiments ...


--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] IndexSorter optimizer

Reply via email to