: For the first change, logic is that Lucene's default length normalization : punishes long documents too much. I found contrib's sweet-spot-similarity : helpful here, but not enough. I found that a better doc-length : normalization method is one that considers collection statistics - e.g. : average doc length. The nice problem with such an approach is that you
for the record: SweetSpotSimilarity was designed to let you use the average length as the "sweet spot" (both min and max) but as you say: you have to know/guess the average length in advance (and explicitly set it using SweetSpotSimilarity.setLengthNormFactors(yourAvg,yourAvg,steepness)) : able to boost the title by (say) 3, but in fact, there is no "IR'ish" : difference between finding the searched text in the title field or in the : body field - they really serve/answer the same information need. For that : matter, I believe that using a single document length when searching all : these fields is more "accurate". if i understand your argument, the pure "IR'ish" model makes no difference between finding input in differnet fields, so it should use a single document length -- but that assumes you are searching for the input in all fields right? if you have a title field and a body field but you are only searching for "title:word" then why should the length of hte body fieild factor into the score at all? my basic understanding of IRish relevancy judgements of the idf/tf model is that "documents" are generic lumps of text .. while Lucene has a "Document" class that supports multipel Fields, the most direct mapping of IR "documents" and Lucene objects would be a Document with only one Field ... but that doesn't mean that if a Document has more then one field that the length of all fields should be a facor in searching for a "field1:word" (any more then the tf used should be the frequency of "word" in all fields) Note: for the record, i think it would be nice if the Score computation "engine" of Lucene allowed for more hooks to let people control stuff like this, so that it is possible in use cases where it makes sense -- i'm just cno convinced it makes sense as the general case. (ie: even if it were possible, i don't think it would make sense as the default) : that doc. If you agree about the potential improvement here, again, a nice : problem is that current Similarity API does not even allow to consider this : info (the average term frequency in the specific document) because : Similarity.tf(int/float freq) takes only the frequency param. One way to : open way for such computation is to add an "int docid" param to the : Similarity class, but then the implementation of that class becomes : IndexReader aware. this strikes me as as not being a deficiency in the existing Query/Scorer/Similaity APIs as much as perhaps an opportunity for new Query types (with new Scorer types) that take advantage of the Payloads API (i say this having almost 0 knowledge of what the payloads API looks like) to take advantage of more term, doc, and index level payloads. Any of this type of logic could be put into new Query subclasses, the trick becomes allowing the common match of those new Query classes to be refactored out and controled by classes at runtime -- this is really what the current Similarity class does, provide a refactoring of common types of calculations in such a way that those calculations can be replaced at run time (at least: that's how i view the Similarity class) i'm just typing out loud here without thinkng it through, but one appraoch we could take to allow Similaity to be extended in special uses cases, would be if each new Query's getSimilarity() method tested the similarity from it's parent to see if it implemented an interface indicating that it contained the method(s) this query needed to perform it's custom calculations ... if it does: hurray! ... if not, then wrap it in the "default" version of that marker interface. ie... public class DocRelativeTfTermQuery extends TermQuery { ... public Similarity getSimilarity(Searcher searcher) { Similarity s = super.getSimilarity(Searcher searcher); if (s instanceOf DocRelativeTfSimilaity) return s; return AverageTfSimilaityWrapper(s); } public static interface DocRelativeTfSimilaity { public float tf(int doc, float freq); } public static class AverageTfSimilaityWrapper extends SimilarityDelegator implements DocRelativeTfSimilaity { public DocRelativeTfSimilarityWrapper(Similarity s) { super(s) } public float tf(int doc, float freq) { return tf(freq - averageFreq(doc)); } protected averageFreq(int doc) { ...get from document payload ... } } ...so now by default the DocRelativeTfTermQuery computes the score for that term based on the average tf of all terms in that field ... but if you construct a Similarity of your own that implements DocRelativeTfSimilaity you can make it use any document specific calculation you want. The performance of simple queries doesn't change, the api for simple users of Similaity doesnt' get more complicated, and we can start supporting arbitrarily complex scoring calculations in Scorers and still allow the "client" to supply the meat of those calculations atrun time. -Hoss --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]