Cao Manh Dat created SOLR-14356:
-----------------------------------

             Summary: PeerSync with hanging nodes
                 Key: SOLR-14356
                 URL: https://issues.apache.org/jira/browse/SOLR-14356
             Project: Solr
          Issue Type: Improvement
      Security Level: Public (Default Security Level. Issues are Public)
            Reporter: Cao Manh Dat


Right now in {{PeerSync}} (during leader election), in case of exception on 
requesting versions to a node, we will skip that node if exception is one the 
following type
* ConnectTimeoutException
* NoHttpResponseException
* SocketException
Sometime the other node basically hang but still accept connection. In that 
case SocketTimeoutException is thrown and we consider the {{PeerSync}} process 
as failed and the whole shard just basically leaderless forever (as long as the 
hang node still there).

We can't just blindly adding {{SocketTimeoutException}} to above list, since 
[~shalin] mentioned that sometimes timeout can happen because of genuine 
reasons too e.g. temporary GC pause.
I think the general idea here is we obey {{leaderVoteWait}} restriction and 
retry doing sync with others.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to