Hey Michael,

You are right, iterating all deletes with nextClearBit() would run in
O(maxDoc). I am coming from the other direction, where I'm expecting the
number of deletes to be more in the order of 1%-5% of the doc ID space, so
a separate int[] would use lots of heap and probably not help that much
compared with nextClearBit(). My mental model is that the two most common
use-cases are append-only workloads, where there are no deletes at all, and
update workloads, which would commonly have several percents of deleted
docs. It's not clear to me how common it is to have very few deletes.

On Tue, Feb 6, 2024 at 7:03 AM Michael Froh <msf...@gmail.com> wrote:

> Thanks Adrien!
>
> My thinking with a separate iterator was that nextClearBit() is relatively
> expensive (O(maxDoc) to traverse everything, I think). The solution I was
> imagining would involve an index-time change to output, say, an int[] of
> deleted docIDs if the number is sufficiently small (like maybe less than
> 1000). Then the livedocs interface could optionally return a cheap deleted
> docs iterator (i.e. only if the number of deleted docs is less than the
> threshold). Technically, the cost would be O(1), since we set a constant
> bound on the effort and fail otherwise. :)
>
> I think 1000 doc value lookups would be cheap, but I don't know if the
> guarantee is cheap enough to make it into Weight#count.
>
> That said, I'm going to see if iterating with nextClearBit() is
> sufficiently cheap. Hmm... precomputing that int[] for deleted docIDs on
> refresh could be an option too.
>
> Thanks again,
> Froh
>
> On Fri, Feb 2, 2024 at 11:38 PM Adrien Grand <jpou...@gmail.com> wrote:
>
>> Hi Michael,
>>
>> Indeed, only MatchAllDocsQuery knows how to produce a count when there
>> are deletes.
>>
>> Your idea sounds good to me, do you actually need a side car iterator for
>> deletes, or could you use a nextClearBit() operation on the bit set?
>>
>> I don't think we can fold it into Weight#count since there is an
>> expectation that it is negligible compared with the cost of a naive count,
>> but we may be able to do it in IndexSearcher#count or on the OpenSearch
>> side.
>>
>> Le ven. 2 févr. 2024, 23:50, Michael Froh <msf...@gmail.com> a écrit :
>>
>>> Hi,
>>>
>>> On OpenSearch, we've been taking advantage of the various O(1)
>>> Weight#count() implementations to quickly compute various aggregations
>>> without needing to iterate over all the matching documents (at least when
>>> the top-level query is functionally a match-all at the segment level). Of
>>> course, from what I've seen, every clever Weight#count()
>>> implementation falls apart (returns -1) in the face of deletes.
>>>
>>> I was thinking that we could still handle small numbers of deletes
>>> efficiently if only we could get a DocIdSetIterator for deleted docs.
>>>
>>> Like suppose you're doing a date histogram aggregation, you could get
>>> the counts for each bucket from the points tree (ignoring deletes), then
>>> iterate through the deleted docs and decrement their contribution from the
>>> relevant bucket (determined based on a docvalues lookup). Assuming the
>>> number of deleted docs is small, it should be cheap, right?
>>>
>>> The current LiveDocs implementation is just a FixedBitSet, so AFAIK it's
>>> not great for iteration. I'm imagining adding a supplementary "deleted docs
>>> iterator" that could sit next to the FixedBitSet if and only if the number
>>> of deletes is "small". Is there a better way that I should be thinking
>>> about this?
>>>
>>> Thanks,
>>> Froh
>>>
>>

-- 
Adrien

Reply via email to