HoustonPutman commented on code in PR #561: URL: https://github.com/apache/solr-operator/pull/561#discussion_r1196543849
########## docs/solr-cloud/autoscaling.md: ########## @@ -0,0 +1,85 @@ +<!-- + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to You under the Apache License, Version 2.0 + the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. + --> + +# SolrCloud Scaling +_Since v0.8.0_ + +Solr Clouds are complex distributed systems, and thus require additional help when trying to scale up or down. + +Scaling/Autoscaling can mean different things in different situations, and this is true even within the `SolrCloud.spec.autoscaling` section. +- Replicas can be moved when new nodes are added or when nodes need to be taken down +- Nodes can be added/removed if more or less resources are desired. + +The following sections describes all the features that the Solr Operator currently supports to aid in scaling & autoscaling SolrClouds. + +## Configuration + +The `autoscaling` section in the SolrCloud CRD can be configured in the following ways + +```yaml +spec: + autoscaling: + vacatePodsOnScaleDown: true # Default: true +``` + +## Replica Movement + +Solr can be scaled up & down either manually or by `HorizontalPodAutoscaler`'s, however no matter how the `SolrCloud.Spec.Replicas` value +changes, the Solr Operator must implement this change the same way. + +For now Replicas are not scaled up and down themselves, they are just moved to utilize new Solr pods or vacate soon-to-be-deleted Solr pods. + +### Solr Pod Scale-Down + +When the desired number of Solr Pods that should be run `SolrCloud.Spec.Replicas` is decreased, +the `SolrCloud.spec.autoscaling.vacatePodsOnScaleDown` option determines whether the Solr Operator should move replicas +off of the pods that are about to be deleted. + +When a StatefulSet, which the Solr Operator uses to run Solr pods, has its size decreased by `x` pods, it's the last +`x` pods that are deleted. So if a StatefulSet `tmp` has size 4, it will have pods `tmp-0`, `tmp-1`, `tmp-2` and `tmp-3`. +If that `tmp` then is scaled down to size 2, then pods `tmp-2` and `tmp-3` will be deleted because they are `tmp`'s last pods numerically. + +If Solr has replicas placed on the pods that will be deleted as a part of the scale-down, then it has a problem. +Solr will expect that these replicas will eventually come back online, because they are a part of the clusterState. +However, the Solr Operator has no expectations for these replicas to come back, because the cloud has been scaled down. +Therefore, the safest option is to move the replicas off of these pods before the scale-down operation occurs. + +If `autoscaling.vacatePodsOnScaleDown` option is not enabled, then whenever the `SolrCloud.Spec.Replicas` is decreased, +that change will be reflected in the StatefulSet immediately. +Pods will be deleted even if replicas live on those pods. + +If `autoscaling.vacatePodsOnScaleDown` option is enabled, which it is by default, then the following steps occur: +1. Acquire a cluster-ops lock on the SolrCloud. (This means other cluster operations, such as a rolling restart, cannot occur during the scale down operation) +1. Scale down the last pod. + 1. Mark the pod as "notReady" so that traffic is diverted away from this pod. Review Comment: This isn't actually a change, its how any operator-managed pod-stop happens since the last release. See https://github.com/apache/solr-operator/pull/530. But to actually answer your question, it does not affect traffic going to the headless service or individual node service. So traffic going directly to the pod will not be stopped ever. The only service that actually uses the readiness conditions is the common service (ultimately used by the common ingress endpoint). I can make the documentation clearer here. ########## controllers/solr_cluster_ops_util.go: ########## @@ -0,0 +1,218 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package controllers + +import ( + "context" + "errors" + solrv1beta1 "github.com/apache/solr-operator/api/v1beta1" + "github.com/apache/solr-operator/controllers/util" + "github.com/go-logr/logr" + appsv1 "k8s.io/api/apps/v1" + corev1 "k8s.io/api/core/v1" + "k8s.io/utils/pointer" + "sigs.k8s.io/controller-runtime/pkg/client" + "strconv" + "time" +) + +func determineScaleClusterOpLockIfNecessary(ctx context.Context, r *SolrCloudReconciler, instance *solrv1beta1.SolrCloud, statefulSet *appsv1.StatefulSet, podList []corev1.Pod, logger logr.Logger) (clusterOpLock string, clusterOpMetadata string, retryLaterDuration time.Duration, err error) { + desiredPods := int(*instance.Spec.Replicas) + configuredPods := int(*statefulSet.Spec.Replicas) + if desiredPods != configuredPods { + scaleTo := -1 + // Start a scaling operation + if desiredPods < configuredPods { + // Scale down! + // The option is enabled by default, so treat "nil" like "true" + if instance.Spec.Autoscaling.VacatePodsOnScaleDown == nil || *instance.Spec.Autoscaling.VacatePodsOnScaleDown { + if desiredPods > 0 { + // We only support one scaling down one pod at-a-time if not scaling down to 0 pods + scaleTo = configuredPods - 1 + } else { + scaleTo = 0 + } + } else { + // The cloud is not setup to use managed scale-down + err = scaleCloudUnmanaged(ctx, r, statefulSet, desiredPods, logger) + } + } else if desiredPods > configuredPods { + // Scale up! + // TODO: replicasScaleUp is not supported, so do not make a clusterOp out of it, just do the patch + err = scaleCloudUnmanaged(ctx, r, statefulSet, desiredPods, logger) + } + if scaleTo > -1 { + clusterOpLock = util.ScaleLock + clusterOpMetadata = strconv.Itoa(scaleTo) + } + } + return +} + +func handleLockedClusterOpScale(ctx context.Context, r *SolrCloudReconciler, instance *solrv1beta1.SolrCloud, statefulSet *appsv1.StatefulSet, podList []corev1.Pod, logger logr.Logger) (retryLaterDuration time.Duration, err error) { + if scalingToNodes, hasAnn := statefulSet.Annotations[util.ClusterOpsMetadataAnnotation]; hasAnn { + if scalingToNodesInt, convErr := strconv.Atoi(scalingToNodes); convErr != nil { + logger.Error(convErr, "Could not convert statefulSet annotation to int for scale-down-to information", "annotation", util.ClusterOpsMetadataAnnotation, "value", scalingToNodes) + err = convErr + } else { + replicaManagementComplete := false + if scalingToNodesInt < int(*statefulSet.Spec.Replicas) { + // Manage scaling down the SolrCloud + replicaManagementComplete, err = handleManagedCloudScaleDown(ctx, r, instance, statefulSet, scalingToNodesInt, podList, logger) + // } else if scalingToNodesInt > int(*statefulSet.Spec.Replicas) { + // TODO: Utilize the scaled-up nodes in the future, however Solr does not currently have APIs for this. + // TODO: Think about the order of scale-up and restart when individual nodeService IPs are injected into the pods. + // TODO: Will likely want to do a scale-up of the service first, then do the rolling restart of the cluster, then utilize the node. + } else { + // This shouldn't happen. The ScalingToNodesAnnotation is removed when the statefulSet size changes, through a Patch. + // But if it does happen, we should just remove the annotation and move forward. + patchedStatefulSet := statefulSet.DeepCopy() + delete(patchedStatefulSet.Annotations, util.ClusterOpsLockAnnotation) + delete(patchedStatefulSet.Annotations, util.ClusterOpsMetadataAnnotation) + if err = r.Patch(ctx, patchedStatefulSet, client.StrategicMergeFrom(statefulSet)); err != nil { + logger.Error(err, "Error while patching StatefulSet to remove unneeded clusterLockOp annotation for scaling to the current amount of nodes") + } else { + statefulSet = patchedStatefulSet + } + } + + // Scale down the statefulSet to represent the new number of utilizedPods, if it is lower than the current number of pods + // Also remove the "scalingToNodes" annotation, as that acts as a lock on the cluster, so that other operations, + // such as scale-up, pod updates and further scale-down cannot happen at the same time. + if replicaManagementComplete { + patchedStatefulSet := statefulSet.DeepCopy() + patchedStatefulSet.Spec.Replicas = pointer.Int32(int32(scalingToNodesInt)) + delete(patchedStatefulSet.Annotations, util.ClusterOpsLockAnnotation) + delete(patchedStatefulSet.Annotations, util.ClusterOpsMetadataAnnotation) + if err = r.Patch(ctx, patchedStatefulSet, client.StrategicMergeFrom(statefulSet)); err != nil { + logger.Error(err, "Error while patching StatefulSet to scale down SolrCloud", "newUtilizedNodes", scalingToNodesInt) + } + + // TODO: Create event for the CRD. + } else { + // Retry after five minutes to check if the replica management commands have been completed + retryLaterDuration = time.Second * 5 + } + } + // If everything succeeded, the statefulSet will have an annotation updated + // and the reconcile loop will be called again. + + return + } else { + err = errors.New("no clusterOpMetadata annotation is present in the statefulSet") + logger.Error(err, "Cannot perform scaling operation when no scale-to-nodes is provided via the clusterOpMetadata") + return time.Second * 10, err + } +} + +// handleManagedCloudScaleDown does the logic of a managed and "locked" cloud scale down operation. +// This will likely take many reconcile loops to complete, as it is moving replicas away from the nodes that will be scaled down. +func handleManagedCloudScaleDown(ctx context.Context, r *SolrCloudReconciler, instance *solrv1beta1.SolrCloud, statefulSet *appsv1.StatefulSet, scaleDownTo int, podList []corev1.Pod, logger logr.Logger) (replicaManagementComplete bool, err error) { + // Before doing anything to the pod, make sure that users cannot send requests to the pod anymore. + podStoppedReadinessConditions := map[corev1.PodConditionType]podReadinessConditionChange{ + util.SolrIsNotStoppedReadinessCondition: { + reason: ScaleDown, + message: "Pod is being deleted, traffic to the pod must be stopped", + status: false, + }, + } + + if scaleDownTo == 0 { + // Delete all collections & data, the user wants no data left if scaling the solrcloud down to 0 + // This is a much different operation to deleting the SolrCloud/StatefulSet all-together + replicaManagementComplete, err = evictAllPods(ctx, r, instance, podList, podStoppedReadinessConditions, logger) + } else { + // Only evict the last pod, even if we are trying to scale down multiple pods. + // Scale down will happen one pod at a time. + replicaManagementComplete, err = evictSinglePod(ctx, r, instance, scaleDownTo, podList, podStoppedReadinessConditions, logger) + } + // TODO: It would be great to support a multi-node scale down when Solr supports evicting many SolrNodes at once. Review Comment: Ahhh yeah, this is more because right now even if someone scales down the SolrCloud by 2, say 6 -> 4, the Solr Operator has to do it one node at a time. So it will first scale down from 6 -> 5 then 5 -> 4. I agree that the HPA is rarely going to scale down that aggressively, but it'll just be a small improvement in the existing Solr API that we can then use to make the managed multi-node scale down operations much faster. I agree its a small thing. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
