joshsouza commented on PR #511:
URL: https://github.com/apache/solr-operator/pull/511#issuecomment-1379371469
At the moment we're not tuning the liveness/readiness/startup probes from
the defaults the operator provides.
That said, here are my thoughts:
1. Liveness and Readiness are two different concepts, and the general advice
I've seen settled on is that they should rarely be the same check.
- Liveness should ensure that the application is running (hasn't errored
in such a manner that pid1 in the container is up, but the app is not running).
When this check fails, the pod should be terminated entirely.
- Startup should check the same thing liveness does, but account for the
worst-case startup delay (the app takes 5 minutes to boot etc...). The Startup
probe should take precedence over an initial delay for the liveness check
(since the liveness check won't begin until after the startup probe finishes),
and generally isn't needed unless the app takes time to boot up.
- Readiness should ensure that the application is prepared to handle
incoming requests, and should be considered a good target for the load
balancer/service to send traffic to (Usually a live endpoint, in this case the
metrics one makes a lot of sense to me)
2. I'm not sure if using the same endpoint for both liveness and readiness
is appropriate for this particular application, but the questions that come to
mind are: Can this be overloaded to the point where it's still valid/running,
but needs to process requests in flight before it can handle more? If so, we
need to make sure the liveness probe _won't_ fail in that scenario, but the
readiness check _would_, so that it stops receiving new traffic, but is
permitted to finish the in-flight requests.
3. Given what I can infer here, (that there's only the one endpoint, and it
could be overloaded, but we wouldn't want it to remain overloaded for an
extended period of time) I would suggest (opinions, definitely up for debate):
- Increase the failure threshold for the startup probe from 5 to 15
(right now the pod has 10s to start, or it is considered a failure. I would
bump that to 30s, but that's just me)
- Increase the failure threshold on the liveness probe from 3 to 6 (with
a period of 10, this means that if the pod can't handle any requests for a
minute, it's probably best to kill it off)
- Reduce the period on the readiness probe from 10s to 5s (so if it
can't respond for 15s, drop it from the service. That gives it 45s to sort out
requests already in flight before being considered dead)
- Or, ideally, separate the readiness and liveness checks to match their
purpose more closely (checking the pid is running for liveness etc...)
Just some brain-dump on my gut reaction, very much open to discussion and/or
correction. :)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]