Hoss Man created SOLR-13100:
-------------------------------

             Summary: harden/manage connectionpool used for intra-cluster 
communication when we know nodes go down
                 Key: SOLR-13100
                 URL: https://issues.apache.org/jira/browse/SOLR-13100
             Project: Solr
          Issue Type: Improvement
      Security Level: Public (Default Security Level. Issues are Public)
            Reporter: Hoss Man


I'm spinng this idea off of some comments i made in SOLR-13028...

In that issue, in discussion of some test failures that can happen after a node 
is shutdown/restarted (new emphasis added)...

{quote}
The bit where the test fails is that it:

# shuts down a jetty instance
# starts the jetty instance again
# does some waiting for all the collections to be "active" and all the replicas 
to be "live"
# tries to send an auto-scalling 'set-cluster-preferences' config change to the 
cluster

The bit of test code where it does this creates an entirely new 
CLoudSolrClient, ignoring the existing one except for the ZKServer address, 
w/an explicit comment that the reason it's doing this is because the connection 
pool on the existing CloudSolrClient might have a stale connection to the old 
(Ie: dead) instance of the restarted jetty...
  ...
...doing this ensures that the cloudClient doesn't try to query the "dead" 
server directly (on a stale connection) but IIUC this issue of stale 
connections to the dead server instance is still problematic - and the root 
cause of this failure - because after the CloudSolrClient picks a random node 
to send the request to, _on the remote solr side, that node then has to 
dispatch a request to each and every node, and at that point the node doing the 
distributed dispatch may also have a stale connection pool pointing at a server 
instance that's no longer listening._
{quote}

*The point of this issue is to explore, if/how we can -- in general -- better 
deal with pooled connections in situations where the cluster state knows that 
an existing node has gone down, or been restarted.*

SOLR-13028 is a particular example of when/how stale pooled conection info can 
cause test problems -- and the bulk of the discussion in that issue is about 
how that specific code path (in dealing with an intra-cl autoscaling handler 
command dispatch) can be improved to do a retry in the event of 
NoHttpResponseException -- but not ever place where solr nodes need to talk to 
each other can blindly retry on every possible connection exception; and even 
when we can, it would be better if we could minimize the risk of the request 
failing in a way that would require a retry.

*So why not improve our HTTP connection pool to be aware of our clusterstate 
and purge connections when we know odes have been shutdown/lost?*



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to