Andrzej Bialecki  created SOLR-12479:
----------------------------------------

             Summary: TriggerAction failures may cause inconsistent trigger 
behavior
                 Key: SOLR-12479
                 URL: https://issues.apache.org/jira/browse/SOLR-12479
             Project: Solr
          Issue Type: Bug
      Security Level: Public (Default Security Level. Issues are Public)
          Components: AutoScaling
    Affects Versions: 7.4, master (8.0)
            Reporter: Andrzej Bialecki 


The following issue occasionally appears when running 
{{TestLargeCluster.testNodeLost}}.

The test kills a large number of nodes, waiting for a certain time between the 
kills. Depending on the sequence and the length of {{waitFor}} it may happen 
that when {{ExecutePlanAction}} processes MOVEREPLICA the target node may just 
have been killed. This results in an exception and a FAILED status of the 
action.

However, this failure is not reported back to the trigger as unprocessed event 
because it happens asynchronously in the action executor (in 
{{ScheduledTriggers}}) - so the trigger happily resets its internal state to no 
longer track the lost node. As a result, replicas remain lost and even if 
there’s a Policy violation the event will not be generated again, and the 
number of replicas won’t go back to the original number.

Also, {{ScheduledTriggers:311}} and 323 only logs the exception but doesn’t 
fire listeners with FAILED status, which is a bug.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to