GabrielBrascher edited a comment on pull request #4111:
URL: https://github.com/apache/cloudstack/pull/4111#issuecomment-849180646


   @PaulAngus just to be clear, I am not touching on the `HAState`, but instead 
on the `ResourceState`. And this has specific reasons that I hope will be 
clarified with this answer. 
   
   I am sorry for any confusion that might be due to the naming that was 
chosen; I can change the whole context from declaring host as `Degraded` to 
other options, such as `Problematic`, or any other word (initially it was 
`Dead`, but it was changed to a better word).
   
   With that said, I totally agree with you when you say that this 
implementation adds complexity to an already convoluted execution flow. But out 
of all the options I imagine it to be one of the best options.
   
   
   I created `ResourceState.Degraded` to keep the current HA state machine as 
it is, avoiding higher complexity and eventual issues. On top of that, the 
proposed state `ResourceState.Degraded` creates a unique case, where instead of 
the CloudStack HA defining a problematic host, the Admin places the Host as 
_Degraded_. This is something that helps on tracking what is happening and also 
to get back the host from the `ResourceState.Degraded` state to 
`ResourceState.Enabled`, or `ResourceState.Maintenance`. 
   
   Additionally, your idea of "forcing" transiting the `HAState` to a state 
that would lead to either `Fencing` or `Recovering` the host is quite 
interesting but has a few issues.
   
   1. the `HAState.Degraded` state does not allow transiting to 
`HAState.Recovering` neither to `HAState.Fencing`. Here is how the 
`HAState.Degraded` state transition table works:
   
   ```
   +----------+---------------------------------------+------------+
   | State    | Event                                 | Next State |
   +----------+---------------------------------------+------------+
   | Degraded | Event.Disabled                        | Disabled   |
   +----------+---------------------------------------+------------+
   | Degraded | Event.Ineligible                      | Ineligible |
   +----------+---------------------------------------+------------+
   | Degraded | Event.HealthCheckFailed               | Degraded   |
   +----------+---------------------------------------+------------+
   | Degraded | Event.HealthCheckPassed               | Available  |
   +----------+---------------------------------------+------------+
   | Degraded | Event.PeriodicRecheckResourceActivity | Suspect    |
   +----------+---------------------------------------+------------+
   ```
   See: 
https://github.com/apache/cloudstack/blob/master/api/src/main/java/org/apache/cloudstack/ha/HAConfig.java
   
   2. the HA state machine allows transiting to `HAState.Recovering` via 
`HAState.Checking`, when a threshold of failures is reached; which is not 
feasible to do in the specific situation that this PR has been designed to 
handle
   ```
   
   +----------+------------------------------------------------+------------+
   | State    | Event                                          | Next State |
   +----------+------------------------------------------------+------------+
   | Checking | Event.Disabled                                 | Disabled   |
   +----------+------------------------------------------------+------------+
   | Checking | Event.Ineligible                               | Ineligible |
   +----------+------------------------------------------------+------------+
   | Checking | Event.TooFewActivityCheckSamples               | Suspect    |
   +----------+------------------------------------------------+------------+
   | Checking | Event.ActivityCheckFailureUnderThresholdRatio  | Degraded   |
   +----------+------------------------------------------------+------------+
   | Checking | Event.ActivityCheckFailureOverThresholdRatio   | Recovering |
   +----------+------------------------------------------------+------------+.
   ```
   
   3. the HA State machine was designed with the only purpose of being an 
automated system in which the transitions are all based on failure checks; 
therefore, adding a manual operation is quite hard and might do more harm than 
good.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to