gsmiller commented on pull request #127: URL: https://github.com/apache/lucene/pull/127#issuecomment-834034324
@rmuir interesting thought. There might be something to optimize here with the knowledge that each value (within the context of a doc) is sorted, but I don't think it will fully resolve the need to rollup within the context of a doc (e.g., calling startDoc() and endDoc()) or the need for bit sets completely. It's also possible I'm not fully understanding your suggestion though or overlooking something. So for starters, the ranges requested for counting can overlap, but the elementary ranges created when setting up the segment tree cannot. For single-value docs, the algorithm first increments elementary interval counts for each doc. Of course, in this step, there's no risk of double-counting since each doc only has one value and the elementary intervals don't overlap. Then, at the end, the "rollup" step increments the requested ranges by rolling up the elementary intervals into the requested ranges. Note that each elementary interval can contribute to multiple requested ranges, and more than one elementary interval con contribute to a single requested range. But, even though multiple elementary intervals can roll up to the same requested range, it's accurate to sum these counts for the requested range since we know each elementary interval count came from a single doc (since the docs are single-valued). For multi-value cases, there are two places where double-counting can happen: First, when accumulating into the elementary intervals, it's possible a doc will have multiple values that land in the same interval. This is the more obvious case, and the one I'm solving with `multiValuedDocLeafHits`. This is also where I think your suggestion could be used, not bothering to even set an elementary range if we know we've already set it (by taking advantage of the sorted property of the doc values). The trickier double-counting case is when rolling up the elementary intervals into the requested ranges. Because multiple elementary intervals can roll into one requested range, if the same document contributed to multiple elementary intervals for one range, we'd double-count the requested range. So if we waiting until the end to do the rollup, if multiple elementary intervals rolled up to one requested range, we don't know if we can actually sum the counts since it's possible they came from the same doc. This is the root problem I'm trying to address by doing the rollup after each doc, and also why I need `multiValuedDocRangeHits` (so multiple elementary intervals don't double-count the same requested range for the same doc). Apologies if I'm telling you a bunch of stuff you already know about this code or completely missing your point. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org