Chris Hostetter wrote: > i guess i'm not following how exactly your pivoted norm calculation works > ... it sounds like you are still rewarding 1 term long fields more then
True. > any other length ... is the distinction between your approach and the > default implementation just that the default is a smooth curve, > while yours > is two differnet curves -- one below the pivot (average length) and one > above it? ... which functions do you use? Basically it is (1 - Slope) * Pivot + (Slope) * Doclen Where Pivot reflects on the average doc length, and Smaller Slope reduces the amount by which short docs are preferred over long ones. In collection with very long documents, a doc shorter than the pivot would be rewarded, but that same doc would be rewarded relatively less in a collection with shorter docs. So how much you reward adapts to the specific collection characteristics, without knowing these characteristics in advance. > : question is how to compute/store/retrieve this data. > : The way I experimented with it was not focused on efficiency > : but rather on flexibility at search time, my custom analyzer > : counted the number of unique tokens in the document, and finally > : a field was added to the document with this number. At search > : time this field was loaded (for all docs), the average was > > One option to avoid that extra work at index building time would be to > use logic like what's in LengthNormModifier to build a cache when the > IndexReader is opened containing the number of terms (either unique or > total depending on wether you use +=freq or ++) in each doc per field. > > it's really no different then a FieldCache -- except that the > FieldCache.getCustom API doesn't really give you the means to compute > arbitrary values, but the principle is the same. I think both are not good enough for large dynamic collections. Both are good enough for experiments. But it should be more efficient in a working dynamic large system. > : natural way to do this is to have two fields "body" and > : "title", set their boosts 1 for "body" and 3 for "title", > : and then, when one searches the entire document (without > : specifying a field), create a multi field query. Things should > : work fine, - boosts are ok, tf() is by field, so is norm. > : But empirically it doesn't work well. When I modified > > were the boosts you are refering to index time boosts or query time > boosts? if they were index time (and you applied them to every document > since in theory the title of every document is worht 3 times as much as > the the body of that document) then i think your index time boosts wound > up being a complete wash. No, they were query time boosts. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]