[ 
https://issues.apache.org/jira/browse/SOLR-13100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16732378#comment-16732378
 ] 

Mark Miller commented on SOLR-13100:
------------------------------------

I can spend a little more time here, but a quick reply.

 

HTTP 1/1.1 has a race condition around stale connections that cannot be 
overcome - though it can be defended against, and we have put a lot of effort 
there. Restarting a test Jetty instance is still a problem area - I think you 
can work around it by making explicit calls to the pool to check for expired 
connections at the right time, but otherwise it's done on a timer with a thread 
and somehow shutting down Jetty and starting it again does not play quite 
correctly with our stale connection avoidance strategy (try to make client 
manage the connection lifecycle instead of the server). We used to have more 
control over some of theinner workings around sweeping for idle connections and 
such, but now I think we use more internal impls with a newer http client 
version. I think the speed of an integration test makes this an issue, I think 
it's less likely to see real world.

HTTP 2 has no such problems with this race, you can just see if a connection is 
valid, so I think this just goes away with 8.

> harden/manage connectionpool used for intra-cluster communication when we 
> know nodes go down
> --------------------------------------------------------------------------------------------
>
>                 Key: SOLR-13100
>                 URL: https://issues.apache.org/jira/browse/SOLR-13100
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Hoss Man
>            Priority: Major
>
> I'm spinng this idea off of some comments i made in SOLR-13028...
> In that issue, in discussion of some test failures that can happen after a 
> node is shutdown/restarted (new emphasis added)...
> {quote}
> The bit where the test fails is that it:
> # shuts down a jetty instance
> # starts the jetty instance again
> # does some waiting for all the collections to be "active" and all the 
> replicas to be "live"
> # tries to send an auto-scalling 'set-cluster-preferences' config change to 
> the cluster
> The bit of test code where it does this creates an entirely new 
> CLoudSolrClient, ignoring the existing one except for the ZKServer address, 
> w/an explicit comment that the reason it's doing this is because the 
> connection pool on the existing CloudSolrClient might have a stale connection 
> to the old (Ie: dead) instance of the restarted jetty...
>   ...
> ...doing this ensures that the cloudClient doesn't try to query the "dead" 
> server directly (on a stale connection) but IIUC this issue of stale 
> connections to the dead server instance is still problematic - and the root 
> cause of this failure - because after the CloudSolrClient picks a random node 
> to send the request to, _on the remote solr side, that node then has to 
> dispatch a request to each and every node, and at that point the node doing 
> the distributed dispatch may also have a stale connection pool pointing at a 
> server instance that's no longer listening._
> {quote}
> *The point of this issue is to explore, if/how we can -- in general -- better 
> deal with pooled connections in situations where the cluster state knows that 
> an existing node has gone down, or been restarted.*
> SOLR-13028 is a particular example of when/how stale pooled conection info 
> can cause test problems -- and the bulk of the discussion in that issue is 
> about how that specific code path (in dealing with an intra-cl autoscaling 
> handler command dispatch) can be improved to do a retry in the event of 
> NoHttpResponseException -- but not ever place where solr nodes need to talk 
> to each other can blindly retry on every possible connection exception; and 
> even when we can, it would be better if we could minimize the risk of the 
> request failing in a way that would require a retry.
> *So why not improve our HTTP connection pool to be aware of our clusterstate 
> and purge connections when we know odes have been shutdown/lost?*



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to