GabrielBrascher edited a comment on pull request #4111: URL: https://github.com/apache/cloudstack/pull/4111#issuecomment-849180646
@PaulAngus just to be clear, I am not touching on the `HAState`, but instead on the `ResourceState`. And this has specific reasons that I hope will be clarified with this answer. I am sorry for any confusion that might be due to the naming that was chosen; I can change the whole context from declaring host as `Degraded` to other options, such as `Problematic`, or any other word (initially it was `Dead`, but it was changed to a better word). With that said, I totally agree with you when you say that this implementation adds complexity to an already convoluted execution flow. But out of all the options I imagine it to be one of the best options. I created `ResourceState.Degraded` to keep the current HA state machine as it is, avoiding higher complexity and eventual issues. On top of that, the proposed state `ResourceState.Degraded` creates a unique case, where instead of the CloudStack HA defining a problematic host, the Admin places the Host as _Degraded_. This is something that helps on tracking what is happening and also to get back the host from the `ResourceState.Degraded` state to `ResourceState.Enabled`, or `ResourceState.Maintenance`. Additionally, your idea of "forcing" transiting the `HAState` to a state that would lead to either `Fencing` or `Recovering` the host is quite interesting but has a few issues. 1. the `HAState.Degraded` state does not allow transiting to `HAState.Recovering` neither to `HAState.Fencing`. Here is how the `HAState.Degraded` state transition table works: ``` +----------+---------------------------------------+------------+ | State | Event | Next State | +----------+---------------------------------------+------------+ | Degraded | Event.Disabled | Disabled | +----------+---------------------------------------+------------+ | Degraded | Event.Ineligible | Ineligible | +----------+---------------------------------------+------------+ | Degraded | Event.HealthCheckFailed | Degraded | +----------+---------------------------------------+------------+ | Degraded | Event.HealthCheckPassed | Available | +----------+---------------------------------------+------------+ | Degraded | Event.PeriodicRecheckResourceActivity | Suspect | +----------+---------------------------------------+------------+ ``` See: https://github.com/apache/cloudstack/blob/master/api/src/main/java/org/apache/cloudstack/ha/HAConfig.java 2. the HA state machine allows transiting to `HAState.Recovering` via `HAState.Checking`, when a threshold of failures is reached; which is not feasible to do in the specific situation that this PR has been designed to handle ``` +----------+------------------------------------------------+------------+ | State | Event | Next State | +----------+------------------------------------------------+------------+ | Checking | Event.Disabled | Disabled | +----------+------------------------------------------------+------------+ | Checking | Event.Ineligible | Ineligible | +----------+------------------------------------------------+------------+ | Checking | Event.TooFewActivityCheckSamples | Suspect | +----------+------------------------------------------------+------------+ | Checking | Event.ActivityCheckFailureUnderThresholdRatio | Degraded | +----------+------------------------------------------------+------------+ | Checking | Event.ActivityCheckFailureOverThresholdRatio | Recovering | +----------+------------------------------------------------+------------+. ``` 3. the HA State machine was designed with the only purpose of being an automated system in which the transitions are all based on failure checks; therefore, adding a manual operation is quite hard and might do more harm than good. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
