[ 
https://issues.apache.org/jira/browse/SOLR-17792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18001391#comment-18001391
 ] 

Gus Heck commented on SOLR-17792:
---------------------------------

SOLR-17419 was committed days before SOLR-17158 was ready, and it definitely 
made things much more difficult (delaying it by weeks). I worked through it,  
beasted related tests and definitely found and removed some deadlocks at that 
time. I seem to recall that ensuring 'happens-before' was an issue behind one 
of the hard to reproduce failures I fought so don't forget to think about that 
side effect of synchronization. If a deadlock is still possible, certainly we 
should eliminate it. I tried to leave copious notes in comments and there's 
some discussion in the 17158 ticket that won't want to get forgotten of course. 
The poll is arbitrary, and I think I floated the idea of making it configurable 
in side conversations, but that was met with the sentiment that such a thing 
might be over doing it (on the assumption that it was only encountered in a 
rare case). 

Therefore, I'm curious what the proportion of the results you described is, and 
how the overall response time varied (if at all) when you went back to 
HttpShardHandlerFactory. Of course the additional question is what was the 
variation in the queries themselves. Is this a replay of queries gleaned from 
logs type situation, or randomly selected terms in a simple, consistently 
shaped query? Do you have any evidence from profiling or jstack of a deadlock? 
If you share what you're finding I'll try to help with sorting it out.

> ParallelHttpShardHandler has massive performance issues.
> --------------------------------------------------------
>
>                 Key: SOLR-17792
>                 URL: https://issues.apache.org/jira/browse/SOLR-17792
>             Project: Solr
>          Issue Type: Bug
>    Affects Versions: 9.8
>            Reporter: Houston Putman
>            Priority: Blocker
>             Fix For: 9.9
>
>
> SOLR-17158 changed the way that the HttpShardHandler (And 
> ParallelHttpShardHandler) did locking and concurrency. However, after 
> upgrading, and running distributed queries (at a relatively slow rate), I 
> noticed that there were 3 types of responses:
>  * QTimes between 3-6ms
>  * QTimes between 53-56 ms
>  * And requests that timed out
> Looking at the logic in HttpShardHandler, there is a poll(50ms) call that is 
> very suspicious, and likely the reason for the jump between 3-6 ms and 53-56 
> ms. I would also assume that this change in concurrency logic is the reason 
> that many requests started timing out. Changing to the 
> HttpShardHandlerFactory from the ParallellHttpShardHandlerFactory fixed these 
> issues.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

Reply via email to