[ https://issues.apache.org/jira/browse/SOLR-13100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16732635#comment-16732635 ]
Mark Miller commented on SOLR-13100: ------------------------------------ You might be able to - I mention that more of the impl is internal now because I think I worked around this when we did some of the impl and how I did it then doesn't work now. You may be able to tie invalidating the right connections when someone leaves live nodes for 7x without much of a race, but I'm not sure if it's as straightforward as it seems. > harden/manage connectionpool used for intra-cluster communication when we > know nodes go down > -------------------------------------------------------------------------------------------- > > Key: SOLR-13100 > URL: https://issues.apache.org/jira/browse/SOLR-13100 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Reporter: Hoss Man > Priority: Major > > I'm spinng this idea off of some comments i made in SOLR-13028... > In that issue, in discussion of some test failures that can happen after a > node is shutdown/restarted (new emphasis added)... > {quote} > The bit where the test fails is that it: > # shuts down a jetty instance > # starts the jetty instance again > # does some waiting for all the collections to be "active" and all the > replicas to be "live" > # tries to send an auto-scalling 'set-cluster-preferences' config change to > the cluster > The bit of test code where it does this creates an entirely new > CLoudSolrClient, ignoring the existing one except for the ZKServer address, > w/an explicit comment that the reason it's doing this is because the > connection pool on the existing CloudSolrClient might have a stale connection > to the old (Ie: dead) instance of the restarted jetty... > ... > ...doing this ensures that the cloudClient doesn't try to query the "dead" > server directly (on a stale connection) but IIUC this issue of stale > connections to the dead server instance is still problematic - and the root > cause of this failure - because after the CloudSolrClient picks a random node > to send the request to, _on the remote solr side, that node then has to > dispatch a request to each and every node, and at that point the node doing > the distributed dispatch may also have a stale connection pool pointing at a > server instance that's no longer listening._ > {quote} > *The point of this issue is to explore, if/how we can -- in general -- better > deal with pooled connections in situations where the cluster state knows that > an existing node has gone down, or been restarted.* > SOLR-13028 is a particular example of when/how stale pooled conection info > can cause test problems -- and the bulk of the discussion in that issue is > about how that specific code path (in dealing with an intra-cl autoscaling > handler command dispatch) can be improved to do a retry in the event of > NoHttpResponseException -- but not ever place where solr nodes need to talk > to each other can blindly retry on every possible connection exception; and > even when we can, it would be better if we could minimize the risk of the > request failing in a way that would require a retry. > *So why not improve our HTTP connection pool to be aware of our clusterstate > and purge connections when we know odes have been shutdown/lost?* -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org