Just to clarify, Approach 1 and approach 2 are both currently performing ok currently for us. -John
On Thu, Apr 2, 2009 at 2:41 PM, Michael McCandless < luc...@mikemccandless.com> wrote: > On Thu, Apr 2, 2009 at 4:43 PM, Jason Rutherglen > <jason.rutherg...@gmail.com> wrote: > >> What does Bobo use the cached bitsets for? > > > > Bobo is a faceting engine that uses custom field caches and sometimes > cached > > bitsets rather than relying exclusively on bitsets to calculate facets. > It > > is useful where many facets (50+) need to be calculated and bitset > caching, > > loading and intersection would be too costly. Instead it iterates over > in > > memory custom field caches while hit collecting. Because we're also > doing > > realtime search, making the loading more efficient via the in memory > field > > cache merging is interesting. > > OK. > > Does it operate at the segment level? Seems like that'd give you good > enough realtime performance (though merging in RAM will definitely be > faster). > > > True, we do the in memory merging with deleted docs, norms would be good > as > > well. > > Yes, and eventually column stride fields. > > > As a first step how should we expose the segments a segment has > > originated from? > > I'm not sure; it's quite messy. Each segment must track what other > segment it got merged to, and must hold a copy of its deletes as of > the time it was merged. And each segment must know what other > segments it got merged with. > > Is this really a serious problem in your realtime search? Eg, from > John's numbers in using payloads to read in the docID -> UID mapping, > it seems like you could make a Query that when given a reader would go > and do the "Approach 2" to perform the deletes (if indeed you are > needing to delete thousands of docs with each update). What sort of > docs/sec rates are you needing to handle? > > > I would like to get this implemented for 2.9 as a building > > block that perhaps we can write other things on. > > I think that's optimistic. It's still at the > hairy-can't-see-a-clean-way-to-do-it phase still. Plus I'd like to > understand that all other options have been exhausted first. > > Especially once we have column stride fields and they are merged in > RAM, you'll be handed a reader pre-warmed and you can then jump > through those arrays to find docs to delete. > > > Column stride fields still > > requires some encoding and merging field caches in ram would presumably > be > > faster? > > Yes, potentially much faster. There's no sense in writing through to > disk until commit is called. > > >> Ie we only have to renumber from gen X to X+1, then from X+1 to X+2 > (where > >> each "generation" is a renumbering event). > > > > Couldn't each SegmentReader keep a docmap and the names of the segments > it > > originated from. However the name is not enough of a unique key as > there's > > the deleted docs that change? It seems like we need a unique id for each > > segment reader, where the id is assigned to cloned readers (which > normally > > have the same segment name as the original SR). The ID could be a stamp > > (perhaps only given to readonlyreaders?). That way the > > SegmentReader.getMergedFrom method does not need to return the actual > > readers, but a docmap and the parent readers IDs? It would be assumed > the > > user would be holding the readers somewhere? Perhaps all this can be > > achieved with a callback in IW, and all this logic could be kept somewhat > > internal to Lucene? > > The docMap is a costly way to store it, since it consumes 32 bits per > doc (vs storing a copy of the deleted docs). > > But, then docMap gives you random-access on the map. > > What if prior to merging, or committing merged deletes, there were a > callback to force the app to materialize any privately buffered > deletes? And then the app is not allowed to use those readers for > further deletes? Still kinda messy. > > I think I need to understand better why delete by Query isn't viable > in your situation... > > Mike > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-dev-h...@lucene.apache.org > >