On Tue, Mar 02, 2010 at 05:55:44AM -0500, Michael McCandless wrote: > The problem is, these scoring models need the avg field length (in > tokens) across the entire index, to compute the norms. > > Ie, you can't do that on writing a single segment.
I don't see why not. We can just move everything you're doing on Searcher open to index time, and calculate the stats and norms before writing the segment out. At search time, the only segment with valid norms would be the last one, so we'd make sure the Searcher used those. I think the fact that Lucy always writes one segment per indexing session -- as opposed to Lucene's one segment per document -- makes a difference here. Whether burning norms to disk at index time is the most efficient setup depends on the ratio of commits to searcher-opens. In a multi-node search cluster, pre-calculating norms at index-time wouldn't work well without additional communication between nodes to gather corpus-wide stats. But I suspect the same trick that works for IDF in large corpuses would work for average field length: it will tend to be the stable over time, so you can update it infrequently. > So I think it must be done during searcher init. > > The most we can do is store the aggregates (eg sum of all lengths in > this segment) in the SegmentInfo -- this saves one pass on searcher > init. Logically... token_counts: { segment: { title: 4, content: 154, }, all: { title: 98342, content: 2854213 } } (Would that suffice? I don't recall the gory details of BM25.) As documents get deleted, the stats will gradually drift out of sync, just like doc freq does. However, that's mitigated if you recycle segments that exceed a threshold deletion percentage on a regular basis. > The norms array will be stored in this per-field sim instance. Interesting, but that wasn't where I was thinking of putting them. Similarity objects need to be sent over the network, don't they? At least they do in KS. So I think we need a local per-field PostingsReader object to hold such cached data. > > The insane loose typing of fields in Lucene is going to make it a > > little tricky to implement, though. I think you just have to > > exclude fields assigned to specific similarity implementations from > > your merge-anything-to-the-lowest-common-denominator policy and > > throw exceptions when there are conflicts rather than attempt to > > resolve them. > > Our disposition on conflict (throw exception vs silently coerce) > should just match what we do today, which is to always silently > coerce. What do you do when you have to reconcile two posting codecs like this? * doc id, freq, position, part-of-speech identifier * doc id, boost Do you silently drop all information except doc id? > > Similarity is where we decode norms right now. In my opinion, it > > should be the Similarity object from which we specify per-field > > posting formats. > > I agree. Great, I'm glad we're on the same page about that. > > Similarity implementation and posting format are so closely related > > that in my opinion, it makes sense to tie them. > > This confuses me -- what is stored in these stats (each field's token > length, each field's avg tf, whatever other a codec wants to add over > time...) should be decoupled from the low level format used to store > it? I don't know about that. I don't think it's necessary to decouple them. There might be some minor code duplication, but similarity implementations don't tend to be very large, so the DRY violation doesn't bother me. What's going to be a little tricky is that you can't have just one Similarity.makePostingDecoder() method. Sometime's you'll want a match-only decoder. Sometimes you'll want positions. Sometimes you'll want part-of-speech id. It's more of a interface/roles situation than a subclass situation. > > If you're looking for small steps, my suggestion would be to focus > > on per-field Similarity support. > > Well that alone isn't sufficient -- the index needs to record/provide > the raw stats, and doc boosting (norms array) needs to be done using > these stats. Not sufficient, but it's probably a prerequisite. Since it's a common feature request anyway, I think it's a great place to start: http://lucene.markmail.org/message/ln2xkesici6aksbi http://lucene.markmail.org/thread/46vxibpubogtcy3g http://lucene.markmail.org/message/56bk6wrbwallyjvr https://issues.apache.org/jira/browse/LUCENE-2236 Marvin Humphrey --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org