[ 
https://issues.apache.org/jira/browse/SOLR-13100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16732635#comment-16732635
 ] 

Mark Miller commented on SOLR-13100:
------------------------------------

You might be able to - I mention that more of the impl is internal now because 
I think I worked around this when we did some of the impl and how I did it then 
doesn't work now.

You may be able to tie invalidating the right connections when someone leaves 
live nodes for 7x without much of a race, but I'm not sure if it's as 
straightforward as it seems.

 

 

> harden/manage connectionpool used for intra-cluster communication when we 
> know nodes go down
> --------------------------------------------------------------------------------------------
>
>                 Key: SOLR-13100
>                 URL: https://issues.apache.org/jira/browse/SOLR-13100
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Hoss Man
>            Priority: Major
>
> I'm spinng this idea off of some comments i made in SOLR-13028...
> In that issue, in discussion of some test failures that can happen after a 
> node is shutdown/restarted (new emphasis added)...
> {quote}
> The bit where the test fails is that it:
> # shuts down a jetty instance
> # starts the jetty instance again
> # does some waiting for all the collections to be "active" and all the 
> replicas to be "live"
> # tries to send an auto-scalling 'set-cluster-preferences' config change to 
> the cluster
> The bit of test code where it does this creates an entirely new 
> CLoudSolrClient, ignoring the existing one except for the ZKServer address, 
> w/an explicit comment that the reason it's doing this is because the 
> connection pool on the existing CloudSolrClient might have a stale connection 
> to the old (Ie: dead) instance of the restarted jetty...
>   ...
> ...doing this ensures that the cloudClient doesn't try to query the "dead" 
> server directly (on a stale connection) but IIUC this issue of stale 
> connections to the dead server instance is still problematic - and the root 
> cause of this failure - because after the CloudSolrClient picks a random node 
> to send the request to, _on the remote solr side, that node then has to 
> dispatch a request to each and every node, and at that point the node doing 
> the distributed dispatch may also have a stale connection pool pointing at a 
> server instance that's no longer listening._
> {quote}
> *The point of this issue is to explore, if/how we can -- in general -- better 
> deal with pooled connections in situations where the cluster state knows that 
> an existing node has gone down, or been restarted.*
> SOLR-13028 is a particular example of when/how stale pooled conection info 
> can cause test problems -- and the bulk of the discussion in that issue is 
> about how that specific code path (in dealing with an intra-cl autoscaling 
> handler command dispatch) can be improved to do a retry in the event of 
> NoHttpResponseException -- but not ever place where solr nodes need to talk 
> to each other can blindly retry on every possible connection exception; and 
> even when we can, it would be better if we could minimize the risk of the 
> request failing in a way that would require a retry.
> *So why not improve our HTTP connection pool to be aware of our clusterstate 
> and purge connections when we know odes have been shutdown/lost?*



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to