Re: Computing weight.count() cheaply in the face of deletes?

2024-02-06 Thread Uwe Schindler
Hi, my response was a bit unclear. Before Lucene 4.0 we saved *deletions* in a bitset (1 = doc deleted), so you were able to use the DocIdSetIterator provided directly. At this point there was no sparse implementation. My idea was more about this: "Because we marked *deleted* docs (not live

Re: Computing weight.count() cheaply in the face of deletes?

2024-02-06 Thread Adrien Grand
Good point, I opened an issue to discuss this: https://github.com/apache/lucene/issues/13084. Did we actually use a sparse bit set to encode deleted docs before? I don't recall that. On Tue, Feb 6, 2024 at 2:42 PM Uwe Schindler wrote: > Hi, > > A SparseBitset impl for DELETES would be fine if t

Re: Computing weight.count() cheaply in the face of deletes?

2024-02-06 Thread Uwe Schindler
Hi, A SparseBitset impl for DELETES would be fine if the model in Lucene would encode deleted docs (it did that in earlier times). As deletes are sparse (deletes are in most cases <40%), this would help to make the iterator cheaper. Uwe Am 06.02.2024 um 09:01 schrieb Adrien Grand: Hey Mich

Re: Computing weight.count() cheaply in the face of deletes?

2024-02-06 Thread Adrien Grand
Hey Michael, You are right, iterating all deletes with nextClearBit() would run in O(maxDoc). I am coming from the other direction, where I'm expecting the number of deletes to be more in the order of 1%-5% of the doc ID space, so a separate int[] would use lots of heap and probably not help that

Re: Computing weight.count() cheaply in the face of deletes?

2024-02-05 Thread Michael Froh
Thanks Adrien! My thinking with a separate iterator was that nextClearBit() is relatively expensive (O(maxDoc) to traverse everything, I think). The solution I was imagining would involve an index-time change to output, say, an int[] of deleted docIDs if the number is sufficiently small (like mayb

Re: Computing weight.count() cheaply in the face of deletes?

2024-02-02 Thread Adrien Grand
Hi Michael, Indeed, only MatchAllDocsQuery knows how to produce a count when there are deletes. Your idea sounds good to me, do you actually need a side car iterator for deletes, or could you use a nextClearBit() operation on the bit set? I don't think we can fold it into Weight#count since ther

Computing weight.count() cheaply in the face of deletes?

2024-02-02 Thread Michael Froh
Hi, On OpenSearch, we've been taking advantage of the various O(1) Weight#count() implementations to quickly compute various aggregations without needing to iterate over all the matching documents (at least when the top-level query is functionally a match-all at the segment level). Of course, from