Hi Andrzej,

wow are really great news!
Using the optimized index, I reported previously that some of the top-scoring results were missing. As it happens, the missing results were typically the "junk" pages with high tf/idf but low "boost". Since we collect up to N hits, going from higher to lower "boost" values, the "junk" pages with low "boost" value were automatically eliminated. So, overall the subjective quality of results was improved. On the other hand, some of the legitimate results with a decent "boost" values were also skipped because they didn't fit within the fixed number of hits... ah, well. Perhaps we should limit the number of hits in LimitedCollector using a cutoff "boost" value, and not the maximum number of hits (or maybe both?).

As far we experiment it would be good to have booth.

To conclude, I will add the IndexSorter.java to the core classes, and I suggest to continue the experiments ...

May someone out there in the community has a commercial search engine running (e.g. google appliance or similar) so we may can setup a nutch with the same pages and compare the results. I guess it will be difficult to compare nutch with yahoo or google since nobody of us has a 4 billion index up and running. I would run one on my laptop but I do not have the bandwidth to fetch until next two days. :-D
Great work!

Cheers,
Stefan

Reply via email to