Thanks Adrien! My thinking with a separate iterator was that nextClearBit() is relatively expensive (O(maxDoc) to traverse everything, I think). The solution I was imagining would involve an index-time change to output, say, an int[] of deleted docIDs if the number is sufficiently small (like maybe less than 1000). Then the livedocs interface could optionally return a cheap deleted docs iterator (i.e. only if the number of deleted docs is less than the threshold). Technically, the cost would be O(1), since we set a constant bound on the effort and fail otherwise. :)
I think 1000 doc value lookups would be cheap, but I don't know if the guarantee is cheap enough to make it into Weight#count. That said, I'm going to see if iterating with nextClearBit() is sufficiently cheap. Hmm... precomputing that int[] for deleted docIDs on refresh could be an option too. Thanks again, Froh On Fri, Feb 2, 2024 at 11:38 PM Adrien Grand <jpou...@gmail.com> wrote: > Hi Michael, > > Indeed, only MatchAllDocsQuery knows how to produce a count when there are > deletes. > > Your idea sounds good to me, do you actually need a side car iterator for > deletes, or could you use a nextClearBit() operation on the bit set? > > I don't think we can fold it into Weight#count since there is an > expectation that it is negligible compared with the cost of a naive count, > but we may be able to do it in IndexSearcher#count or on the OpenSearch > side. > > Le ven. 2 févr. 2024, 23:50, Michael Froh <msf...@gmail.com> a écrit : > >> Hi, >> >> On OpenSearch, we've been taking advantage of the various O(1) >> Weight#count() implementations to quickly compute various aggregations >> without needing to iterate over all the matching documents (at least when >> the top-level query is functionally a match-all at the segment level). Of >> course, from what I've seen, every clever Weight#count() >> implementation falls apart (returns -1) in the face of deletes. >> >> I was thinking that we could still handle small numbers of deletes >> efficiently if only we could get a DocIdSetIterator for deleted docs. >> >> Like suppose you're doing a date histogram aggregation, you could get the >> counts for each bucket from the points tree (ignoring deletes), then >> iterate through the deleted docs and decrement their contribution from the >> relevant bucket (determined based on a docvalues lookup). Assuming the >> number of deleted docs is small, it should be cheap, right? >> >> The current LiveDocs implementation is just a FixedBitSet, so AFAIK it's >> not great for iteration. I'm imagining adding a supplementary "deleted docs >> iterator" that could sit next to the FixedBitSet if and only if the number >> of deletes is "small". Is there a better way that I should be thinking >> about this? >> >> Thanks, >> Froh >> >