msokolov commented on pull request #235:
URL: https://github.com/apache/lucene/pull/235#issuecomment-896098498


   Thanks for all the comments; I'll follow up with a new commit that addresses 
them soon. `1 / (1 + x)` makes a lot of sense; I was groping towards it :)
   
   Re: the random-distribution assumption for segments -- I believe this 
depends very much on the use case. Our experience in e-commerce is it is 
*usually* true. We've seen occasional outlying cases (more popular media 
products get re-indexed more often, and there can be correlation if 
*popularity* is an important query feature, which it is), but this is more the 
exception than the rule. OTOH a time-series index is likely to be heavily 
correlated, so a different strategy is appropriate (also, sequential operation 
can more easily re-use thresholds across segments, and if the segments can be 
sorted, that will help). Perhaps the vanilla approach (collect K per segment) 
is best as a safe first step, but I think some optimization here will be 
heavily impactful since the `K` directly influences the number of nodes 
explored in the graph, and thence the query cost. Maybe it will deserve some 
kind of parameterization - so yes, I agree, let's remove this for now, and 
follow up later.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to