Hi,
I'm happy to report that further tests performed on a larger index seem
to show that the overall impact of the IndexSorter is definitely
positive: performance improvements are significant, and the overall
quality of results seems at least comparable, if not actually better.
The reason why result quality seems better is quite interesting, and it
shows that the simple top-N measures that I was using in my benchmarks
may have been too simplistic.
Using the original index, it was possible for pages with high tf/idf of
a term, but with a low "boost" value (the OPIC score), to outrank pages
with high "boost" but lower tf/idf of a term. This phenomenon leads
quite often to results that are perceived as "junk", e.g. pages with a
lot of repeated terms, but with little other real content, like for
example navigation bars.
Using the optimized index, I reported previously that some of the
top-scoring results were missing. As it happens, the missing results
were typically the "junk" pages with high tf/idf but low "boost". Since
we collect up to N hits, going from higher to lower "boost" values, the
"junk" pages with low "boost" value were automatically eliminated. So,
overall the subjective quality of results was improved. On the other
hand, some of the legitimate results with a decent "boost" values were
also skipped because they didn't fit within the fixed number of hits...
ah, well. Perhaps we should limit the number of hits in LimitedCollector
using a cutoff "boost" value, and not the maximum number of hits (or
maybe both?).
This again brings to attention the importance of the OPIC score: it
represents a query-independent opinion about the quality of the page -
whichever way you calculate it. If you use PageRank, it (allegedly)
corresponds to other people's opinions about the page, thus providing an
"objective" quality opinion. If you use a simple list of
white/black-listed sites that you like/dislike, then it represents your
own subjective opinion on the quality of the site; etc, etc... In this
way, running a search engine that provides "good" results is not just a
plain precision, recall, tf/idf and other tangible measures, it's also a
sort of political statement of the engine's operator. ;-)
To conclude, I will add the IndexSorter.java to the core classes, and I
suggest to continue the experiments ...
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems? Stop! Download the new AJAX search engine that makes
searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers