Thanks for your comments Chris, and sorry for the delayed response - you raised some tough questions for me, and I felt I have to clear my thoughts on this before replying. (Well, as you'll see below they are not too clear now either, but I am going to be off-line for the next ~10 days, so decided to not wait more with this...)
Chris Hostetter wrote: > for the record: SweetSpotSimilarity was designed to let you > use the average length as the "sweet spot" (both min and max) > but as you say: you have to know/guess the average length in > advance (and explicitly set it using > SweetSpotSimilarity.setLengthNormFactors(yourAvg,yourAvg, > steepness)) I didn't try this - passing the computed avg doc length to SweetSpotSimilarity (SSS) - it would be interesting to try. I wonder how this would perform comparing to the variation of pivoted (unique) length normalization that I tried. The difference is that SSS punishes docs above and below the range, while with pivoted normalization docs above the pivot are punished and those below the pivot boosted. Pivoted normalization makes more sense to me than SSS. Assuming this proves to be useful in Lucene as a general improvement (collection independent, largely) - question is how to compute/store/retrieve this data. The way I experimented with it was not focused on efficiency but rather on flexibility at search time, my custom analyzer counted the number of unique tokens in the document, and finally a field was added to the document with this number. At search time this field was loaded (for all docs), the average was computed (the pivot), and both the pivot and the unique length were used for normalization with some slop. One way to do something like this more efficiently would be to store the average (unique) length in the segment, and to update this value when segments are merged (need to think about doc deletions). This works nice with the single doc in a segment, not sure how it would work with the new code in LUCENE-843. This perhaps goes too far too early with implementation details, and I wonder if flexible index format can satisfy/represent adding this type of data to the index. > if i understand your argument, the pure "IR'ish" model makes > no difference between finding input in differnet fields, so > it should use a single document length -- but that assumes > you are searching for the input in all fields right? if you > have a title field and a body field but you are only searching > for "title:word" then why should the length of hte body fieild > factor into the score at all? > > my basic understanding of IRish relevancy judgements of the > idf/tf model is that "documents" are generic lumps of text .. > while Lucene has a "Document" class that supports multipel > Fields, the most direct mapping of IR "documents" and > Lucene objects would be a Document with onlyone Field > ... but that doesn't mean that if a Document has more > then one field that the length of all fields should be a > facor in searching for a "field1:word" (any more then the > tf used should be the frequency of "word" in all fields) These are good arguments. I am not 100% sure here. My use case was of boosting certain fields - say you have just {title} and {body} and want to make title words worth 3 times more. A natural way to do this is to have two fields "body" and "title", set their boosts 1 for "body" and 3 for "title", and then, when one searches the entire document (without specifying a field), create a multi field query. Things should work fine, - boosts are ok, tf() is by field, so is norm. But empirically it doesn't work well. When I modified my index to have a single field in which I just multiplied the title 3 times, I got better results. There are questions - is there a pure mathematical explanation, that the sum of the scores/statistics by separate fields yielded poorer quality than the single score/statistics of a single field containing all the text? Perhaps the loss of information in the norms was more damaging with more (smaller) fields? (I had four btw) - I don't know. I just saw that a single field is better (when this is possible), and I went on with it. Now, the new payloads allow to specify boost at token level. So it is possible (although not yet very easy) to set that for the title words, and not separate to fields just for boosts. This would remove the original motivation I started with - because in my case there really was no need to separate to fields except for boosts. Looking at the general case - multi-field Lucene document and a user query on only part of those fields. Currently Lucene does it separately for each field. When I first noticed this behavior in Lucene, (which was different than what I was used to), I was surprised. Then I thought that this is really the right thing - assume docs with two fields: "friends" and "enemies" - what sense is there to consider statistics of "friends" when searching in "enemies"? (I am taking your side here). This is a weird example, having two fields that negate each other, but it makes the question clearer. So, for query "Mr. Jones and Mrs. James" in "friends" field, a document would be considered only for matches in the "friends" field. If both doc1 and doc2 contain "Mr. Jones", in "friends", and their length of "friends" is equal, but doc2's "enemies" field is longer, does it make sense to punish doc2? I think that for this example, if anything, doc2 should have been boosted, because it has more enemies, but even so, one of the friends is "Mr. Jones". So theoretically I agree with you. In reality, I don't know if we often get to see examples as that last one. And, it would be expensive to maintain the (accurate) information needed for such pivot normalization for each field of each document. But maintaining just once for the entire document may be possible... Mmm... adding this to segment's field-info we can do it per field to... should probably be optional. (If the above reads uncertain and somewhat confused it is because currently this is how it is..) > Note: for the record, i think it would be nice if the Score > computation "engine" of Lucene allowed for more hooks to let > people controlstuff like this, so that it is possible in use > cases where it makes sense -- i'm just not convinced it makes > sense as the general case. (ie: even if it were > possible, i don't think it would make sense as the default) Agree. > : a nice problem is that current Similarity API does not > : even allow to consider this info (the average term frequency > : in the specific document) because Similarity.tf(int/float freq) > : takes only the frequency param. One way to open way for such > : computation is to add an "int docid" param to the > : Similarity class, but then the implementation of that class > : becomes IndexReader aware. > this strikes me as as not being a deficiency in the existing > Query/Scorer/Similaity APIs as much as perhaps an opportunity for new > Query types (with new Scorer types) that take advantage of the Payloads > API (i say this having almost 0 knowledge of what the payloads API looks > like) to take advantage of more term, doc, and index level payloads. Any > of this type of logic could be put into new Query subclasses, the trick > becomes allowing the common match of those new Query classes to be > refactored out and controled by classes at runtime -- this is really what > the current Similarity class does, provide a refactoring of common types > of calculations in such a way that those calculations can be replaced at > run time (at least: that's how i view the Similarity class) > > i'm just typing out loud here without thinkng it through, but > one appraoch > we could take to allow Similaity to be extended in special uses cases, > would be if each new Query's getSimilarity() method tested the similarity > from it's parent to see if it implemented an interface indicating that it > contained the method(s) this query needed to perform it's custom > calculations ... if it does: hurray! ... if not, then wrap it in the > "default" version of that marker interface. > > ie... > > public class DocRelativeTfTermQuery extends TermQuery { > ... > public Similarity getSimilarity(Searcher searcher) { > Similarity s = super.getSimilarity(Searcher searcher); > if (s instanceOf DocRelativeTfSimilaity) return s; > return AverageTfSimilaityWrapper(s); > } > public static interface DocRelativeTfSimilaity { > public float tf(int doc, float freq); > } > public static class AverageTfSimilaityWrapper > extends SimilarityDelegator > implements DocRelativeTfSimilaity { > public DocRelativeTfSimilarityWrapper(Similarity s) { super(s) } > public float tf(int doc, float freq) { > return tf(freq - averageFreq(doc)); > } > protected averageFreq(int doc) { > ...get from document payload ... > } > } > > > ...so now by default the DocRelativeTfTermQuery computes the score for > that term based on the average tf of all terms in that field ... but if > you construct a Similarity of your own that implements > DocRelativeTfSimilaity you can make it use any document specific > calculation you want. > > The performance of simple queries doesn't change, the api for > simple users > of Similaity doesnt' get more complicated, and we can start supporting > arbitrarily complex scoring calculations in Scorers and still allow the > "client" to supply the meat of those calculations atrun time. This is nice. I didn't think of it. It may be nice if instead of creating new types of queries the existing ones (Span, Boolean, Phrase, Wild) could be somehow "set to" use DocRelativeTfTermQuery instead of TermQuery. ? > -Hoss Thanks again Hoss for the detailed comments and great ideas. Doron --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]