Andrzej Bialecki  created SOLR-11730:
----------------------------------------

             Summary: Test NodeLost / NodeAdded dynamics
                 Key: SOLR-11730
                 URL: https://issues.apache.org/jira/browse/SOLR-11730
             Project: Solr
          Issue Type: Sub-task
      Security Level: Public (Default Security Level. Issues are Public)
            Reporter: Andrzej Bialecki 


Let's consider a "flaky node" scenario.

A node is going up and down at short intervals (eg. due to a flaky network 
cable). If the frequency of these events coincides with {{waitFor}} interval in 
{{nodeLost}} trigger configuration, the node may never be reported to the 
autoscaling framework as lost. Similarly it may never be reported as added back 
if it's lost again within the {{waitFor}} period of {{nodeAdded}} trigger.

Other scenarios are possible here too, depending on timing:
* node being constantly reported as lost
* node being constantly reported as added

One possible solution for the autoscaling triggers is that the framework should 
keep a short-term ({{waitFor * 2}} long?) memory of a node state that the 
trigger is tracking in order to eliminate flaky nodes (ie. those that 
transitioned between states more than once within the period).

Situation like this is detrimental to SolrCloud behavior regardless of 
autoscaling actions, so it should probably be addressed at a node level by eg. 
shutting down Solr node after the number of disconnects in a time window 
reaches a certain threshold.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to