endzyme commented on PR #511:
URL: https://github.com/apache/solr-operator/pull/511#issuecomment-1380505771

   Thanks for reaching out about considerations. As Josh mentioned, we're not 
deviating from the current defaults in 0.6.0. 
   
   I wanted to addon to what Josh has mentioned above. 
   
   This application is a little different than other apps because all the nodes 
are clustered and have traffic routing among them based on what data they hold 
for what collections. This is important because it changes the contextual 
reason behind readiness probes. Readiness probes mostly affect the "status" of 
the endpoint associating the pod to any services it's a part of. Since traffic 
can still hit nodes, even when they're pulled from the service, then the 
readiness probe doesn't really have much effect on incoming requests. I'd have 
to think a little more on the intended value of readiness with how Solr works. 
To be candid, I'm not sure if the operator configures communication between 
nodes via services or directly with pod names. If configured with services then 
readiness probes could impact communication between nodes in the SolrCloud.
   
   As for liveness probes, the main value I see is when restarting the java 
process would actually resolve a problem. Liveness should really only trigger 
when the node cannot perform the most basic tasks but the process still appears 
to be running. Things that come to mind are an inability to read from disk, 
causing many 500s but still technically "running". Another could be "runaway 
threads" which bomb the process and should be terminated to recover service 
availability. That said, it should be blend of a service critical KPI with how 
long you'd be comfortable having this pod unavailable.
   
   All the above are just considerations for original intent of the tools and 
should definitely be considered with how SolrCloud is intended to operate. 
   
   On more consideration we commonly see in the wild is negative feedback loops 
for liveness probes. This one is pretty tricky but the simplest example is that 
an application experiencing too much load can trigger its liveness probe, which 
then restart the app. While the app is restarting, it causes more load on the 
remaining pods, causing them to also cascade and trigger their liveness probes. 
These are usually the result at aggressive setting on liveness probes. 
Generally, in Java, the largest contributing factor would be resource 
starvation, like under allotting CPU or memory, which could lead to GC issues. 
   
   Anyway, not sure if the last paragraph is very actionable, but something to 
keep in my mind on initially tuning for aggressive restarts or more acceptable 
time to wait before restarting a service. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to