[GitHub] [solr-operator] HoustonPutman commented on pull request #596: Add in a retry queue for clusterOps

via GitHub Mon, 07 Aug 2023 09:49:35 -0700


HoustonPutman commented on PR #596:
URL: https://github.com/apache/solr-operator/pull/596#issuecomment-1668248340


   > * Do we need to retry operations? I'm thinking that if we just dropped 
operations that e.g. timed out or errored out, the reconcile loop will 
naturally retry, no? Especially in the context of HPA, I bet it re-evaluates 
whether the current cluster has enough horsepower
   
   So yes, in a lot of cases, the operator will notice that the state of the 
cluster is not in the correct place if an operation is stopped. (e.g. if you 
are doing a scale down, the cluster won't be scaled down. If you are doing a 
rolling restart, not all pods will be up-to-date). If these clusterOps get 
deleted, the Solr Operator will re-create the clusterOp because the SolrCloud 
is not in the correct state.
   
   However there are some operations that don't have their state as 
readily-available to the operator. For example in a scale-up operation, the 
first step is to scale up the cluster to the new number of pods. Then the 
operator will call a `BalanceReplicas` command. If that fails, and the 
operation is dropped, the operator does not know that the cluster is imbalanced 
(because the StatefulSet has the correct # of pods). Therefore it doesn't know 
that it should try to redo the `balanceReplicas` command again. This is the 
perfect use case for a "retry" queue. 
   
   Also in the case that we have more cluster operations in the future, a queue 
lets us know the operations that we have tried and are waiting to do in the 
future. Example: We have cluster operations A, B and C, and all need to take 
place. (and they have the ordering preference of A -> B -> C). Both A & B are 
failing, and need C to occur first before A & B can succeed. If we don't have a 
queue to know what is waiting to run in the future, then we will always flip 
between A & B. (A fails -> so skip it and start B. B fails -> pick the next 
cluster operation which is A. We have no clue that A was failing before, so we 
have no reason to not choose to run it next.) C will never get a chance to run. 
If we have the queue, then A will fail, and be added to the queue. Then we will 
go through the necessary operations, skipping A because it is queued and B will 
be chosen. When it fails, it will be added to the queue. Then the operator will 
go through the necessary operations skipping A & B beca
 use they are queued. C will be chosen to run, and succeed. Now A will be 
retried. When it succeeds, B will be retried.
   
   > * Along the same lines, is there a priority that's linked to different 
kinds of requests? For example, if we had a scaleUp operation and a scaleDown 
operation came about, I would assume that whoever came last should win...
   
   Yes, absolutely! So basically if the `scaleUp` happens first, and we have 
started the operation, we at least need to wait until its in a stoppable state 
(which the retryQueue already does). Once it can be stopped, we go through and 
see if any other operations need to take place while it is stopped. The 
operator will see that a `scaleDown` needs to occur, and since `scale` 
operations override eachOther, the queued `scaleUp` will be **replaced** by the 
new `scaleDown` operation. So the `scaleDown` will ultimately win, it just has 
to wait because we need to make sure that data isn't in transit while we are 
switching from the `scaleUp` to the `scaleDown`. 
   
   > ... And that maybe a rolling upgrade (if started) should ideally complete 
before we start scaling, otherwise we might get in trouble
   
   Once again, yes! So the rolling upgrade comes first in the list of things to 
take care of. So if at the exact same time a user increases the podCount of the 
SolrCloud and changes something about the podTemplate, then the rolling restart 
will take precedence over the ScaleUp.
   
   However a rolling restart can take a while, so the "timeout" might happen, 
which will give the scaleUp a chance to start in the middle. Maybe we only 
actually "queue" the operation for later if the Solr Operator encounters an 
error. If it sees no error, then it won't queue the operation.
   
   OR even better idea. there are 2 timeouts:
   - 1 minute for an operation that sees an error
   - 10 minutes for an operation that does not see an error
   
   Once again these are "soft" timeouts, so they will eventually be retried. 
But the 10 minutes gives us a better guarantee of order-of-operations, as you 
mentioned, for operations that are running without issue.
   
   I'll look into doing this ^ now. Good call out.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [solr-operator] HoustonPutman commented on pull request #596: Add in a retry queue for clusterOps

Reply via email to