Hi,
my response was a bit unclear. Before Lucene 4.0 we saved *deletions* in
a bitset (1 = doc deleted), so you were able to use the DocIdSetIterator
provided directly. At this point there was no sparse implementation.
My idea was more about this: "Because we marked *deleted* docs (not live
Good point, I opened an issue to discuss this:
https://github.com/apache/lucene/issues/13084.
Did we actually use a sparse bit set to encode deleted docs before? I don't
recall that.
On Tue, Feb 6, 2024 at 2:42 PM Uwe Schindler wrote:
> Hi,
>
> A SparseBitset impl for DELETES would be fine if
Hi,
A SparseBitset impl for DELETES would be fine if the model in Lucene
would encode deleted docs (it did that in earlier times). As deletes are
sparse (deletes are in most cases <40%), this would help to make the
iterator cheaper.
Uwe
Am 06.02.2024 um 09:01 schrieb Adrien Grand:
Hey
Hey Michael,
You are right, iterating all deletes with nextClearBit() would run in
O(maxDoc). I am coming from the other direction, where I'm expecting the
number of deletes to be more in the order of 1%-5% of the doc ID space, so
a separate int[] would use lots of heap and probably not help that
Thanks Adrien!
My thinking with a separate iterator was that nextClearBit() is relatively
expensive (O(maxDoc) to traverse everything, I think). The solution I was
imagining would involve an index-time change to output, say, an int[] of
deleted docIDs if the number is sufficiently small (like
Hi Michael,
Indeed, only MatchAllDocsQuery knows how to produce a count when there are
deletes.
Your idea sounds good to me, do you actually need a side car iterator for
deletes, or could you use a nextClearBit() operation on the bit set?
I don't think we can fold it into Weight#count since