HoustonPutman commented on PR #596: URL: https://github.com/apache/solr-operator/pull/596#issuecomment-1668248340
> * Do we need to retry operations? I'm thinking that if we just dropped operations that e.g. timed out or errored out, the reconcile loop will naturally retry, no? Especially in the context of HPA, I bet it re-evaluates whether the current cluster has enough horsepower So yes, in a lot of cases, the operator will notice that the state of the cluster is not in the correct place if an operation is stopped. (e.g. if you are doing a scale down, the cluster won't be scaled down. If you are doing a rolling restart, not all pods will be up-to-date). If these clusterOps get deleted, the Solr Operator will re-create the clusterOp because the SolrCloud is not in the correct state. However there are some operations that don't have their state as readily-available to the operator. For example in a scale-up operation, the first step is to scale up the cluster to the new number of pods. Then the operator will call a `BalanceReplicas` command. If that fails, and the operation is dropped, the operator does not know that the cluster is imbalanced (because the StatefulSet has the correct # of pods). Therefore it doesn't know that it should try to redo the `balanceReplicas` command again. This is the perfect use case for a "retry" queue. Also in the case that we have more cluster operations in the future, a queue lets us know the operations that we have tried and are waiting to do in the future. Example: We have cluster operations A, B and C, and all need to take place. (and they have the ordering preference of A -> B -> C). Both A & B are failing, and need C to occur first before A & B can succeed. If we don't have a queue to know what is waiting to run in the future, then we will always flip between A & B. (A fails -> so skip it and start B. B fails -> pick the next cluster operation which is A. We have no clue that A was failing before, so we have no reason to not choose to run it next.) C will never get a chance to run. If we have the queue, then A will fail, and be added to the queue. Then we will go through the necessary operations, skipping A because it is queued and B will be chosen. When it fails, it will be added to the queue. Then the operator will go through the necessary operations skipping A & B beca use they are queued. C will be chosen to run, and succeed. Now A will be retried. When it succeeds, B will be retried. > * Along the same lines, is there a priority that's linked to different kinds of requests? For example, if we had a scaleUp operation and a scaleDown operation came about, I would assume that whoever came last should win... Yes, absolutely! So basically if the `scaleUp` happens first, and we have started the operation, we at least need to wait until its in a stoppable state (which the retryQueue already does). Once it can be stopped, we go through and see if any other operations need to take place while it is stopped. The operator will see that a `scaleDown` needs to occur, and since `scale` operations override eachOther, the queued `scaleUp` will be **replaced** by the new `scaleDown` operation. So the `scaleDown` will ultimately win, it just has to wait because we need to make sure that data isn't in transit while we are switching from the `scaleUp` to the `scaleDown`. > ... And that maybe a rolling upgrade (if started) should ideally complete before we start scaling, otherwise we might get in trouble Once again, yes! So the rolling upgrade comes first in the list of things to take care of. So if at the exact same time a user increases the podCount of the SolrCloud and changes something about the podTemplate, then the rolling restart will take precedence over the ScaleUp. However a rolling restart can take a while, so the "timeout" might happen, which will give the scaleUp a chance to start in the middle. Maybe we only actually "queue" the operation for later if the Solr Operator encounters an error. If it sees no error, then it won't queue the operation. OR even better idea. there are 2 timeouts: - 1 minute for an operation that sees an error - 10 minutes for an operation that does not see an error Once again these are "soft" timeouts, so they will eventually be retried. But the 10 minutes gives us a better guarantee of order-of-operations, as you mentioned, for operations that are running without issue. I'll look into doing this ^ now. Good call out. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
