Terry Steichen wrote:
The goal of the default implementation is not to give any special weight to shorter documents, but rather to remove the advantage longer documents have. Longer documents are likely to have more matches simply because they contain more terms. Also, for the query "foo", a document containing just "foo" is a better match than a longer one containing "foo bar baz", since the match is more exact.Can you give me an idea of what to replace the lengthNorm() method with to, for example, remove any special weight given to shorter matching documents?
However, one problem with this approach can be that very short documents are in fact not very informative. Thus a bias against very short documents is sometimes useful.
That's something to try, although, as mentioned above, I suspect your top hits will be dominated by long documents. Try it. It's really not a difficult experiment!I can certainly go through a bunch of trial-and-error efforts, but it would help if I had some grasp of the logic initially.For example, from DefaultSimilarity, here's the lengthNorm() method: public float lengthNorm(String fieldName, int numTerms) { return (float)(1.0 / Math.sqrt(numTerms)); } Should I (for the purpose of eliminating any size bias) override it to always return a 1?
One trick I've used to keep very short documents from dominating results, that, while good matches, are not informative documents, is to override this with something like:
public float lengthNorm(String fieldName, int numTerms) {
super.lengthNorm(fieldName, Math.max(numTerms, 100));
}
This way all fields shorter than 100 terms are scored like fields containing 100 terms. Long documents are still normalized, but search is biased a bit against very short documents.
You could do that here too. So, for example, you could do something like:How would I boost the headline field here? Is that how you are supposed to use the (presently unused) fieldName parameter? If that's the case, I assume I would logically (to do what I'm trying to do) make this factor greater than 1 for the 'headline' field, and 1 for all other fields?
public float lengthNorm(String fieldName, int numTerms) {
float n = super.lengthNorm(fieldName, Math.max(numTerms, 100));
if (fieldName.equals("headline"))
n *= 4.0f;
return n;
}
Equivalently, you could create your documents with something like:
Document d = new Document();
Field f = new Field.Text("headline", headline);
f.setBoost(4.0f);
...
But headlines tend to be short, and naturally benefit from the default lengthNorm implementation. So what you really might want is something like:
public float lengthNorm(String fieldName, int numTerms) {
if (fieldName.equals("headline"))
return 4.0f * super.lengthNorm(fieldName, numTerms);
else
return super.lengthNorm(fieldName, Math.max(numTerms, 100));
}
This is probably what I'd try first.
Doug
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
