HoustonPutman commented on PR #511: URL: https://github.com/apache/solr-operator/pull/511#issuecomment-1402784282
Thanks for both of your thoughts! As for the question on whether the liveness/readiness endpoint should be the same, I do not think they should be eventually. I like `admin/info/system` as the handler for liveness (the same as it is now), since that basically just responds if solr is running. When 8.0 is our minimum version (soon) then using `admin/info/health` would be great for the readiness probe since we want to make sure that Solr can connect to Zookeeper. Eventually adding a parameter that says most replicas on the host are healthy could be useful, but I think ZK connection is a good place to start. > Can this be overloaded to the point where it's still valid/running, but needs to process requests in flight before it can handle more? If so, we need to make sure the liveness probe won't fail in that scenario, but the readiness check would, so that it stops receiving new traffic, but is permitted to finish the in-flight requests. I think this is hard to do since a lot of the request handling could be updates and queries for specific collections, which we can't know... But definitely agree it would be great to get to this in the end. > I would suggest (opinions, definitely up for debate) (The actual bulk of the changes happening in this PR) **Yeah I've upped the number of checks in the startup probe to 10, giving the pod 1 minute to become healthy. I think that should be enough for the Solr server to start.** **For the others I agree, I have it set as 3 20s checks for the liveness probes, giving us a 40s-1m of "downtime" before taking down the pod. Readiness is set as 2 10 second checks, so if zk isn't available for 10-20 seconds, requests won't be routed to that node. But if its a blip, 1 good readiness check and its back in the list.** > To be candid, I'm not sure if the operator configures communication between nodes via services or directly with pod names. If configured with services then readiness probes could impact communication between nodes in the SolrCloud. So for node-level endpoints (headless service and individual pod services for ingresses), the readiness check is not used for routing, since we use the `publishNotReadyAddresses: true` option for these services. The only service that doesn't use this option is the solrcloud-common service, which is what the readiness probe would be impacting. Also the solr operator's rolling restart logic only uses the readiness probe when calculating the maxPodsDown option, so a readiness probe that is more likely to return errors is going to slow down rolling restarts, but probably not to a large degree. > On more consideration we commonly see in the wild is negative feedback loops for liveness probes. Yeah this is definitely not something we want to take lightly. We only want to restart solr nodes when absolutely necessary. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
