You have not commited NutchSimilarity class (at least I cannot find new version in SVN) so for host and title default length normalization is used. Is it on purpose or by accident?
It was on purpose, but with uncertainty. Sorry I forgot to mention it.
In general, I think we can't assume that new fields should be length-normalized like old ones. We need to experiment first.
In particular, I did not think that titles should have the same length normalization as anchors. We don't penalize much for long anchors (using 1/log(#tokens)) as a long anchor field also corresponds to many anchors, which are a good thing. But in the case of titles there is always only one title, and matches on long titles should probably be penalized as ordinary text. So, unless we find a reason to treat them differently, I thought that titles should use the default length normalization (1/sqrt(#tokens)).
For urls, 1/#tokens was used to boost matches on short urls, which tend to be host matches, what we really want to boost. Now, with the host in a separate field, we have an explicit parameter to boost host matching. So perhaps we could use the default normalization for both of these. On the other hand, we might want to very strongly prefer matches for "foo" on "foo.com" above those on "foo.bar.com" and perhaps of "bar.com/foo" above those on "bar.com/baz/foo". So should we use 1/#tokens on the host field, and 1/sqrt(#tokens) on the entire URL? I don't really know until we've experimented. So I left URL normalization as it was, for back-compatiblity, and gave host normalization the default treatment until we learn more.
I hope this makes sense. And I'm sorry that I silently changed this in your patch.
Cheers,
Doug
------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
