[I] Scale down operation fails and is never requeued if `getReplicasForPod` fails transiently [solr-operator]

via GitHub Tue, 11 Jun 2024 12:34:43 -0700


kos-team opened a new issue, #707:
URL: https://github.com/apache/solr-operator/issues/707

## Bug Report

**What did you do? / Minimal Reproducible Example**
We first deployed a five-replica Solr cluster, then modified the CR to scale
down to three replicas.
The scale down triggers the solr-operator to perform the ManagedScaleDown
operation, invoking the `handleManagedCloudScaleDown` function. Inside the
function, the operator invokes the `evictSinglePod` function. If the call to
the Solr transiently fails (due to transient network issue or transient
unavailability of Solr), the operator writes an error message and never
retries. Even if the transient fault is resolved later, the operator is stuck
and never perform the scale down operation.

To reproduce this bug,

1. apply the CR to deploy a 5 replica Solr cluster by changing replicas to 5
in the example:
https://github.com/apache/solr-operator/blob/main/example/test_solrcloud.yaml
2. Expose faults to the operator (this can be reliably reproduced using the
ChaosMesh controller by injecting a network loss between the Solr pods and the
Solr operator)
3. Change the `replicas: 5` to `replicas:3` in the CR to initiate the scale
down.
4. Observe that the operator reports an error log because it is unable to
contact the Solr Pods for status
```
ERROR Error retrieving cluster status, cannot determine if pod has
replicas {"controller": "solrcloud", "controllerGroup": "solr.apache.org",
"controllerKind": "SolrCloud", "SolrCloud":
{"name":"example","namespace":"default"}, "namespace": "default", "name":
"example", "reconcileID": "66d08e1c-a6cd-4176-a792-c706177cc39f", "error": "Get
\"http://example-solrcloud-common.default/solr/admin/collections?action=CLUSTERSTATUS&wt=json\":
dial tcp 10.96.177.119:80: i/o timeout"}
```
5. Remove the fault and observe that the operator never requeues and
reconciles the request

**What did you expect to see?**
The operator should be able to scale down the cluster after the transient
faults go away

**What did you see instead?**
The operator never requeues the request

**Root Cause**
The root cause of this bug is at
https://github.com/apache/solr-operator/blob/b2c951052828f88e9fc415f54aec189b805117e9/controllers/solr_cluster_ops_util.go#L251C57-L251C71
where the operator does not handle the error returned by `evictSinglePod`
but just directly returns normally.

The fix should be simple to add an error handling branch to requeue the
request.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Scale down operation fails and is never requeued if `getReplicasForPod` fails transiently [solr-operator]

Reply via email to