[ https://issues.apache.org/jira/browse/SOLR-17792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18001391#comment-18001391 ]
Gus Heck commented on SOLR-17792: --------------------------------- SOLR-17419 was committed days before SOLR-17158 was ready, and it definitely made things much more difficult (delaying it by weeks). I worked through it, beasted related tests and definitely found and removed some deadlocks at that time. I seem to recall that ensuring 'happens-before' was an issue behind one of the hard to reproduce failures I fought so don't forget to think about that side effect of synchronization. If a deadlock is still possible, certainly we should eliminate it. I tried to leave copious notes in comments and there's some discussion in the 17158 ticket that won't want to get forgotten of course. The poll is arbitrary, and I think I floated the idea of making it configurable in side conversations, but that was met with the sentiment that such a thing might be over doing it (on the assumption that it was only encountered in a rare case). Therefore, I'm curious what the proportion of the results you described is, and how the overall response time varied (if at all) when you went back to HttpShardHandlerFactory. Of course the additional question is what was the variation in the queries themselves. Is this a replay of queries gleaned from logs type situation, or randomly selected terms in a simple, consistently shaped query? Do you have any evidence from profiling or jstack of a deadlock? If you share what you're finding I'll try to help with sorting it out. > ParallelHttpShardHandler has massive performance issues. > -------------------------------------------------------- > > Key: SOLR-17792 > URL: https://issues.apache.org/jira/browse/SOLR-17792 > Project: Solr > Issue Type: Bug > Affects Versions: 9.8 > Reporter: Houston Putman > Priority: Blocker > Fix For: 9.9 > > > SOLR-17158 changed the way that the HttpShardHandler (And > ParallelHttpShardHandler) did locking and concurrency. However, after > upgrading, and running distributed queries (at a relatively slow rate), I > noticed that there were 3 types of responses: > * QTimes between 3-6ms > * QTimes between 53-56 ms > * And requests that timed out > Looking at the logic in HttpShardHandler, there is a poll(50ms) call that is > very suspicious, and likely the reason for the jump between 3-6 ms and 53-56 > ms. I would also assume that this change in concurrency logic is the reason > that many requests started timing out. Changing to the > HttpShardHandlerFactory from the ParallellHttpShardHandlerFactory fixed these > issues. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org