On Sun, Mar 7, 2010 at 11:43 AM, Marvin Humphrey <mar...@rectangular.com> wrote: > On Sat, Mar 06, 2010 at 05:07:18AM -0500, Michael McCandless wrote: >> > Fortunately, beaming field length data around is an easier problem than >> > distributed IDF, because with rare exceptions, the number of fields in a >> > typical index is miniscule compared to the number of terms. >> >> Right... so how do we control/configure when stats are fully >> recomputed corpus wide.... hmmm. Should be fully app controllable. > > Hmm, at first, I don't like the sound of that. Right now, we're talking about > an esoteric need for a specific plugin, BM25 similarity. The top level > indexer object should be oblivious to the implementation details of plugins.
It's not only for BM25. EG lnu.ltc Sim impl would tap into the avg(tf). Sweet spot sim (if it were to set its plateau by corpus wide avg) would too. > However, the theme here is the need for an individual node to sync up with the > distributed corpus. If you don't do that at index time, you have to do it at > search time, which isn't always ideal. So I can see us building in some sort > of functionality to address that more general case. It would be the flip of > the MultiSearcher-comprised-of-remote-searchables situation. And the NRT case is "both" indexing & searching and I expect in there the app should have clear control. For a large index the stats will be stable after re-indexing only a few more docs. >> > I guess you'd want to accumulate that average while building the segment... >> > oh wait, ugh, deletions are going to make that really messy. :( >> > >> > Think about it for a sec, and see if you swing back to the desirability of >> > calculation on the fly using maxDoc(), like I just did. >> >> I think we'd store a float (holding avg(tf) that you computed when >> inverting that doc, ie, for all unique terms in the doc what's the avg >> of their freqs) for every doc, in the index. Then we can regen fully >> when needed right? > > Hmm, full regeneration would be expensive, so I'd discounted it. You'd have > to iterate the entire posting list for every term, adding up freq() while > skipping deleted docs. No... the stat is avg tf within the doc. So if I index this doc: a a a a b b b c c d The avg(tf) = average(4 3 2 1) = 2.5. So we'd store 2.5 for that docXfield in a fixed-width dense postings list (like column stride fields -- every doc has a value). So the facts stored in the index here is basically the same as token length per docXfield (well, a float instead of an int, but). >> Or maybe we store sum(tf) and #unique terms... hmm. >> >> Handling docs that did not have the field is a good point... but we >> can assign a special value (eg 0.0, or, any negative number say) to >> encode that? > > Where? > > In the full field storage? To slow to recover. > > In the term dictionary? The term dictionary can't store nulls. You'd have to > use sentinels... thus restricting the allowable content of the field?! No > way. > > In the Lucy-style mmap'd sort cache? That would work, because we always have > a "null ord", to which documents which did not supply a value for the field > get assigned in the ords array. However, sort/field caches are orthogonal to > this problem and we don't want to require them for an ancillary need. > > I suppose you could do it by iterating all posting lists for a field and > flipping bits in a bit vector. The bits that are left unset correspond to > docs with null values. No, in new dedicated dense posting list that feels like a column stride field. Sentinel is perfectly fine here since these stats are naturally only positive. Each stat is free to pick its own sentinel... it's a private matter (how it encode/decodes on disk). >> Deletions I think across the board will skew stats until they are >> reclaimed. > > Yes, and unless the stats are fully regenerated when a segment with deletions > get merged away, the averages will be wrong to some degree, with the skew > potentially worsening over time. > > Say that you have a segment with an average field length of 5 for the "tags" > field, but that that average is the result of most docs having 1 tag, while a > handful of docs have 100 tags. Now say you delete all of the docs with 100 > tags. The recorded average for the "tags" field within the segment is now all > messed up -- it should be "1", but it's "5". You have to regenerate a new, > correct average when building a new segment. You can't use the existing value > of "5" as a shortcut, or the consolidated segment's averages will be wrong > from the get-go. > > That's what I was getting at earlier. However, I'd thought that we could get > around the problem by fudging with maxDoc(), and I no longer believe that. I > think full regeneration is the only way. Actually I was wrong -- deletions would be properly taken into account if you ask for "accurate avg" to be regen'd on reopening your reader/searcher. Ie, we store the stats per doc, so it's a single pass through the postings to compute the avg -- O(maxDoc). We would just skip the deleted docs in this pass... So on reopen of reader/searcher user would have to ask for full regen of the stats (to get accurate avg over full corpus) or not (which'd re-use current corpus avg but would compute norms for the new segment(s)). Mike --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org