Aaron Lindsey created GEODE-8200:
------------------------------------
Summary: Rebalance operations stuck in "IN_PROGRESS" state forever
Key: GEODE-8200
URL: https://issues.apache.org/jira/browse/GEODE-8200
Project: Geode
Issue Type: Bug
Components: management
Reporter: Aaron Lindsey
We use the management REST API to call rebalance immediately before stopping a
server to limit the possibility of data loss. In a cluster with 3 locators, 3
servers, and no regions, we noticed that sometimes the rebalance operation
never ends if one of the locators is restarting concurrently with the rebalance
operation.
More specifically, the scenario where we see this issue crop up is during an
automated "rolling restart" operation in a Kubernetes environment which
proceeds as follows:
* At most one locator and one server are restarting at any point in time
* Each locator/server waits until the previous locator/server is fully online
before restarting
* Immediately before stopping a server, a rebalance operation is performed and
the server is not stopped until the rebalance operation is completed
The impact of this issue is that the "rolling restart" operation will never
complete, because it cannot proceed with stopping a server until the rebalance
operation is completed. A human is then required to intervene and manually
trigger a rebalance and stop the server. This type of "rolling restart"
operation is triggered fairly often in Kubernetes — any time part of the
configuration of the locators or servers changes.
The following JSON is a sample response from the management REST API that shows
the rebalance operation stuck in "IN_PROGRESS".
{code}
{
"statusCode": "IN_PROGRESS",
"links": {
"self":
"http://geodecluster-sample-locator.default/management/v1/operations/rebalances/a47f23c8-02b3-443c-a367-636fd6921ea7",
"list":
"http://geodecluster-sample-locator.default/management/v1/operations/rebalances"
},
"operationStart": "2020-05-27T22:38:30.619Z",
"operationId": "a47f23c8-02b3-443c-a367-636fd6921ea7",
"operation": {
"simulate": false
}
}
{code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)