Aaron Lindsey created GEODE-8200:
------------------------------------

             Summary: Rebalance operations stuck in "IN_PROGRESS" state forever
                 Key: GEODE-8200
                 URL: https://issues.apache.org/jira/browse/GEODE-8200
             Project: Geode
          Issue Type: Bug
          Components: management
            Reporter: Aaron Lindsey


We use the management REST API to call rebalance immediately before stopping a 
server to limit the possibility of data loss. In a cluster with 3 locators, 3 
servers, and no regions, we noticed that sometimes the rebalance operation 
never ends if one of the locators is restarting concurrently with the rebalance 
operation.

More specifically, the scenario where we see this issue crop up is during an 
automated "rolling restart" operation in a Kubernetes environment which 
proceeds as follows:
* At most one locator and one server are restarting at any point in time
* Each locator/server waits until the previous locator/server is fully online 
before restarting
* Immediately before stopping a server, a rebalance operation is performed and 
the server is not stopped until the rebalance operation is completed

The impact of this issue is that the "rolling restart" operation will never 
complete, because it cannot proceed with stopping a server until the rebalance 
operation is completed. A human is then required to intervene and manually 
trigger a rebalance and stop the server. This type of "rolling restart" 
operation is triggered fairly often in Kubernetes — any time part of the 
configuration of the locators or servers changes. 

The following JSON is a sample response from the management REST API that shows 
the rebalance operation stuck in "IN_PROGRESS".

{code}
    {
      "statusCode": "IN_PROGRESS",
      "links": {
        "self": 
"http://geodecluster-sample-locator.default/management/v1/operations/rebalances/a47f23c8-02b3-443c-a367-636fd6921ea7";,
        "list": 
"http://geodecluster-sample-locator.default/management/v1/operations/rebalances";
      },
      "operationStart": "2020-05-27T22:38:30.619Z",
      "operationId": "a47f23c8-02b3-443c-a367-636fd6921ea7",
      "operation": {
        "simulate": false
      }
    }
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to