HoustonPutman commented on PR #511:
URL: https://github.com/apache/solr-operator/pull/511#issuecomment-1402784282

   Thanks for both of your thoughts!
   
   As for the question on whether the liveness/readiness endpoint should be the 
same, I do not think they should be eventually. I like `admin/info/system` as 
the handler for liveness (the same as it is now), since that basically just 
responds if solr is running. When 8.0 is our minimum version (soon) then using 
`admin/info/health` would be great for the readiness probe since we want to 
make sure that Solr can connect to Zookeeper. Eventually adding a parameter 
that says most replicas on the host are healthy could be useful, but I think ZK 
connection is a good place to start.
   
   > Can this be overloaded to the point where it's still valid/running, but 
needs to process requests in flight before it can handle more? If so, we need 
to make sure the liveness probe won't fail in that scenario, but the readiness 
check would, so that it stops receiving new traffic, but is permitted to finish 
the in-flight requests.
   
   I think this is hard to do since a lot of the request handling could be 
updates and queries for specific collections, which we can't know... But 
definitely agree it would be great to get to this in the end.
   
   >  I would suggest (opinions, definitely up for debate)
   
   (The actual bulk of the changes happening in this PR)
   
   **Yeah I've upped the number of checks in the startup probe to 10, giving 
the pod 1 minute to become healthy. I think that should be enough for the Solr 
server to start.**
   
   **For the others I agree, I have it set as 3 20s checks for the liveness 
probes, giving us a 40s-1m of "downtime" before taking down the pod. Readiness 
is set as 2 10 second checks, so if zk isn't available for 10-20 seconds, 
requests won't be routed to that node. But if its a blip, 1 good readiness 
check and its back in the list.** 
   
   > To be candid, I'm not sure if the operator configures communication 
between nodes via services or directly with pod names. If configured with 
services then readiness probes could impact communication between nodes in the 
SolrCloud.
   
   So for node-level endpoints (headless service and individual pod services 
for ingresses), the readiness check is not used for routing, since we use the 
`publishNotReadyAddresses: true` option for these services. The only service 
that doesn't use this option is the solrcloud-common service, which is what the 
readiness probe would be impacting. Also the solr operator's rolling restart 
logic only uses the readiness probe when calculating the maxPodsDown option, so 
a readiness probe that is more likely to return errors is going to slow down 
rolling restarts, but probably not to a large degree.
   
   > On more consideration we commonly see in the wild is negative feedback 
loops for liveness probes.
   
   Yeah this is definitely not something we want to take lightly. We only want 
to restart solr nodes when absolutely necessary.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to