[GitHub] [lucene-solr] jimczi commented on issue #854: Shared PQ Based Early Termination for Concurrent Search

2019-09-12 Thread GitBox
jimczi commented on issue #854: Shared PQ Based Early Termination for 
Concurrent Search
URL: https://github.com/apache/lucene-solr/pull/854#issuecomment-530794180
 
 
   > I was thinking of a class similar to HitsThresholdChecker, allowing shared 
state, since, as you rightly said, the size of critical section should be 
small. WDYT?
   
   This is a good idea, it could be used to publish the maximum minimum value 
and collectors would use it to check with their local minimum too.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] [lucene-solr] jimczi commented on issue #854: Shared PQ Based Early Termination for Concurrent Search

2019-09-11 Thread GitBox
jimczi commented on issue #854: Shared PQ Based Early Termination for 
Concurrent Search
URL: https://github.com/apache/lucene-solr/pull/854#issuecomment-530515005
 
 
   > RE: synchronization, what are your thoughts on using a global shared array 
where each collector publishes its bottom value vs message passing?
   
   Sure that's a good way to keep track of the maximum minimum score without 
synchronization. 
   However I think only benchmark can reveal the cost of this so my main 
concern is the change in the API that is required to share the minimum score 
globally. If we can come up with an approach that keeps the current top score 
docs collector simple and contained but isn't optimized in terms of 
synchronization I'd favor simplicity unless we have clear evidence in 
benchmarks that we can do better ;).


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] [lucene-solr] jimczi commented on issue #854: Shared PQ Based Early Termination for Concurrent Search

2019-09-11 Thread GitBox
jimczi commented on issue #854: Shared PQ Based Early Termination for 
Concurrent Search
URL: https://github.com/apache/lucene-solr/pull/854#issuecomment-530504547
 
 
   I think we're talking of different approaches, hence the confusion. It is 
correct that we can start setting the minimum score when the global count of 
document that we collected reaches the requested size but if the local pqs are 
not full you can only use the minimum minimum score. 
   So the bottom score of the minimum scores. 
   Requiring a queue to be filled completely before publishing a minimum score 
allows to use the maximum minimum score among the slices that have a full pq. 
We can mix the two approaches, switching from the minimum minimum to the 
maximum minimum when pqs are filled but I wonder if this is really needed since 
topN is a small value ? Said differently I wonder if checking the global 
minimum score before a single pq is filled is a premature optimization ?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] [lucene-solr] jimczi commented on issue #854: Shared PQ Based Early Termination for Concurrent Search

2019-09-11 Thread GitBox
jimczi commented on issue #854: Shared PQ Based Early Termination for 
Concurrent Search
URL: https://github.com/apache/lucene-solr/pull/854#issuecomment-530443453
 
 
   > Which contract would that be? I believe that today, the contract that we 
call updateMinCompetitiveScore only when the PQ is full is implied (since there 
is only one PQ) but I am still not able to understand why that is mandatory. Is 
there something that I missing?
   
   We need to ensure that the minimum score of the slice can be set as the 
maximum minimum score for all slices, however if the slice is not full this 
means that some document with smaller scores can still enter the top N no 
matter what the top hits are in the other slices. So if the total hit threshold 
is 10 for instance and you ask for 2 top documents per slice, you cannot set 
the minimum score if only 1 document is inserted in each slice otherwise you 
can miss documents.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] [lucene-solr] jimczi commented on issue #854: Shared PQ Based Early Termination for Concurrent Search

2019-09-11 Thread GitBox
jimczi commented on issue #854: Shared PQ Based Early Termination for 
Concurrent Search
URL: https://github.com/apache/lucene-solr/pull/854#issuecomment-530398602
 
 
   > Thanks for offering. I have a skeletal PR in flight for this approach that 
I plan to publish tomorrow -- maybe we can iterate on that?
   
   Sure, thanks
   
   > I am sure I am missing something here. If the user requested top N hits, 
then all slices can keep collecting hits in their thread local PQs and update a 
global counter to reflect if total hits collected globally has reached N. Once 
we have reached N globally, each collector can publish the value of the bottom 
of their thread local PQ. The minimum of all such values will be our global 
minimum score, since we know that, collectively, we have N hits available. Post 
that, all collectors will use the global minimum score to filter hits. If, a 
collector finds a competitive hit, it adds it to the local queue, updates its 
local minimum score and triggers a resync, where the minimum of all minimum 
scores (if that makes sense) is taken and kept as the global worst hit.
   
   I think it would be simpler to keep the maximum minimum score on each slice. 
Each time a slice publish a new minimum score we can broadcast a listener to 
all the other top docs collector that would update their local minimum score if 
needed. Synchronization shouldn't be the bottleneck here but happy to be proven 
wrong. 
   The global counter of total hits must be reached to publish any minimum 
score but the publisher must also ensure that his local pq is full before 
publishing since it is possible to reach the total hits threshold while none of 
the local pq are completely filled so this would break the contract.
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] [lucene-solr] jimczi commented on issue #854: Shared PQ Based Early Termination for Concurrent Search

2019-09-11 Thread GitBox
jimczi commented on issue #854: Shared PQ Based Early Termination for 
Concurrent Search
URL: https://github.com/apache/lucene-solr/pull/854#issuecomment-530373796
 
 
   I am not sure that synchronization would be an issue in this model since the 
update of the global min score should be quick, we can also use double checked 
locking or use a volatile. But it's hard to say without benchmarking as Mike 
said. It would be easier to discuss with a poc to compare with the current 
approach in this pr. I can work on one if you want to concentrate on the shared 
pq path ?
   
   > Once globally, numHits are collected, we do a pass through the array and 
pick the minimum? 
   
   If numHits means the global threshold then it's just one condition. The 
slice must also have a full pq to publish a min score and other slices can use 
it if they need to collect the same number of top docs. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] [lucene-solr] jimczi commented on issue #854: Shared PQ Based Early Termination for Concurrent Search

2019-09-11 Thread GitBox
jimczi commented on issue #854: Shared PQ Based Early Termination for 
Concurrent Search
URL: https://github.com/apache/lucene-solr/pull/854#issuecomment-530345711
 
 
   I think that you need to ensure that the slice publishing the new min score 
is full, otherwise the min score cannot be the global min score. If the other 
slices are not full they can use the new min score immediately but that's 
assuming that all slices have the same topN size.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] [lucene-solr] jimczi commented on issue #854: Shared PQ Based Early Termination for Concurrent Search

2019-09-11 Thread GitBox
jimczi commented on issue #854: Shared PQ Based Early Termination for 
Concurrent Search
URL: https://github.com/apache/lucene-solr/pull/854#issuecomment-530342313
 
 
   > Is this proposal to continue collecting the full N per slice (thread work 
unit), until some/all of them have a full (top N) PQ and at that point you sync 
to find the "best bottom" across all of them? And every time one of the threads 
finds a newly competitive hit, increasing its bottom, we sync again to find the 
best bottom?
   
   Yes that's another option that we should investigate imo. 
   
   > This would still require sync-per-insert, and would require collecting top 
N per thread instead of top N overall? It's hard to judge which is the better 
tradeoff w/o building both and running perf tests :)
   
   If we keep separate pq we'd not need sync-per-insert and could use a 
volatile to record the shared min score within the top docs collectors. It 
seems appealing to me because the top N should be small so the overhead of 
collecting a full topN per slice is small compared to the converging of the min 
score which can be completely different per slice.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] [lucene-solr] jimczi commented on issue #854: Shared PQ Based Early Termination for Concurrent Search

2019-09-04 Thread GitBox
jimczi commented on issue #854: Shared PQ Based Early Termination for 
Concurrent Search
URL: https://github.com/apache/lucene-solr/pull/854#issuecomment-528035227
 
 
   I thought that the follow up for LUCENE-8939 would be to allow the sharing 
of the minimum score (could be extended to a minimum FieldDoc) across slices ? 
Sharing the minimum score (or minimum FieldDoc) requires very little 
synchronization while a global priority queue seems much more costly. The other 
advantage is that we could add this ability in the current topdocs collector 
like we did for LUCENE-8939.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org