gsmiller commented on pull request #127:
URL: https://github.com/apache/lucene/pull/127#issuecomment-834034324


   @rmuir interesting thought. There might be something to optimize here with 
the knowledge that each value (within the context of a doc) is sorted, but I 
don't think it will fully resolve the need to rollup within the context of a 
doc (e.g., calling startDoc() and endDoc()) or the need for bit sets 
completely. It's also possible I'm not fully understanding your suggestion 
though or overlooking something.
   
   So for starters, the ranges requested for counting can overlap, but the 
elementary ranges created when setting up the segment tree cannot. For 
single-value docs, the algorithm first increments elementary interval counts 
for each doc. Of course, in this step, there's no risk of double-counting since 
each doc only has one value and the elementary intervals don't overlap. Then, 
at the end, the "rollup" step increments the requested ranges by rolling up the 
elementary intervals into the requested ranges. Note that each elementary 
interval can contribute to multiple requested ranges, and more than one 
elementary interval con contribute to a single requested range. But, even 
though multiple elementary intervals can roll up to the same requested range, 
it's accurate to sum these counts for the requested range since we know each 
elementary interval count came from a single doc (since the docs are 
single-valued).
   
   For multi-value cases, there are two places where double-counting can 
happen: First, when accumulating into the elementary intervals, it's possible a 
doc will have multiple values that land in the same interval. This is the more 
obvious case, and the one I'm solving with `multiValuedDocLeafHits`. This is 
also where I think your suggestion could be used, not bothering to even set an 
elementary range if we know we've already set it (by taking advantage of the 
sorted property of the doc values). The trickier double-counting case is when 
rolling up the elementary intervals into the requested ranges. Because multiple 
elementary intervals can roll into one requested range, if the same document 
contributed to multiple elementary intervals for one range, we'd double-count 
the requested range. So if we waiting until the end to do the rollup, if 
multiple elementary intervals rolled up to one requested range, we don't know 
if we can actually sum the counts since it's possible they came from 
 the same doc. This is the root problem I'm trying to address by doing the 
rollup after each doc, and also why I need `multiValuedDocRangeHits` (so 
multiple elementary intervals don't double-count the same requested range for 
the same doc).
   
   Apologies if I'm telling you a bunch of stuff you already know about this 
code or completely missing your point.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to