Re: Computing weight.count() cheaply in the face of deletes?

Michael Froh Mon, 05 Feb 2024 22:03:17 -0800

Thanks Adrien!

My thinking with a separate iterator was that nextClearBit() is relatively
expensive (O(maxDoc) to traverse everything, I think). The solution I was
imagining would involve an index-time change to output, say, an int[] of
deleted docIDs if the number is sufficiently small (like maybe less than
1000). Then the livedocs interface could optionally return a cheap deleted
docs iterator (i.e. only if the number of deleted docs is less than the
threshold). Technically, the cost would be O(1), since we set a constant
bound on the effort and fail otherwise. :)


I think 1000 doc value lookups would be cheap, but I don't know if the
guarantee is cheap enough to make it into Weight#count.

That said, I'm going to see if iterating with nextClearBit() is
sufficiently cheap. Hmm... precomputing that int[] for deleted docIDs on
refresh could be an option too.

Thanks again,
Froh

On Fri, Feb 2, 2024 at 11:38 PM Adrien Grand <jpou...@gmail.com> wrote:

> Hi Michael,
>
> Indeed, only MatchAllDocsQuery knows how to produce a count when there are
> deletes.
>
> Your idea sounds good to me, do you actually need a side car iterator for
> deletes, or could you use a nextClearBit() operation on the bit set?
>
> I don't think we can fold it into Weight#count since there is an
> expectation that it is negligible compared with the cost of a naive count,
> but we may be able to do it in IndexSearcher#count or on the OpenSearch
> side.
>
> Le ven. 2 févr. 2024, 23:50, Michael Froh <msf...@gmail.com> a écrit :
>
>> Hi,
>>
>> On OpenSearch, we've been taking advantage of the various O(1)
>> Weight#count() implementations to quickly compute various aggregations
>> without needing to iterate over all the matching documents (at least when
>> the top-level query is functionally a match-all at the segment level). Of
>> course, from what I've seen, every clever Weight#count()
>> implementation falls apart (returns -1) in the face of deletes.
>>
>> I was thinking that we could still handle small numbers of deletes
>> efficiently if only we could get a DocIdSetIterator for deleted docs.
>>
>> Like suppose you're doing a date histogram aggregation, you could get the
>> counts for each bucket from the points tree (ignoring deletes), then
>> iterate through the deleted docs and decrement their contribution from the
>> relevant bucket (determined based on a docvalues lookup). Assuming the
>> number of deleted docs is small, it should be cheap, right?
>>
>> The current LiveDocs implementation is just a FixedBitSet, so AFAIK it's
>> not great for iteration. I'm imagining adding a supplementary "deleted docs
>> iterator" that could sit next to the FixedBitSet if and only if the number
>> of deletes is "small". Is there a better way that I should be thinking
>> about this?
>>
>> Thanks,
>> Froh
>>
>

Re: Computing weight.count() cheaply in the face of deletes?

Reply via email to