gboor opened a new issue, #819:
URL: https://github.com/apache/solr-operator/issues/819

   I am deploying solrcloud to a test cluster using the helm chart for the 
operator and then the helm chart for solrcloud itself (which uses the operator).
   
   I am running solrCloud version 9.10, which is compatible with the chart 
according to the version matrix 
[here](https://apache.github.io/solr-operator/docs/upgrade-notes.html) 
(assuming 9.10 falls under 9.4+). This might mean the issue is specific to 
running with a version override (perhaps 9.10 behaves differently from the 
default 8.11 in the chart values), but I have no easy way of verifying that.
   
   Pretty much all the settings are default, except storage, which is 
persistent for both zk and solrcloud. I also run only 1 instance of zk and 1 
instance of solrcloud on my test cluster, but 3 of each on production and I see 
the same issue in both - it's just more common on test because of the single 
node setup.
   
   Here is what happens;
   1. ZK starts, solrCloud starts, everything is good.
   2. At some point, for whatever reason, ZK goes down temporarily, but 
auto-recovers a few minutes later.
   3. solrCloud goes into a sort of crashed state, but is still accessible. 
When I open the admin panel I see `KeeperErrorCode = Session expired for 
/aliases.json`.
   4. It never recovers. I have to manually delete the pod to make it restart.
   5. After restart, it's fine again.
   
   Note that this does not happen always. Sometimes solrCloud reconnects to zk 
just fine. Sometimes it doesn't. I cannot figure out why.
   
   From some preliminary digging I did, I think I understand why it's not 
restarting in this case, but I just wanted to verify that my assumptions are 
correct;
   
   1. The default readinessProbe reads `/solr/admin/info/health`, which returns 
a 503 in this case and causes the pod to be marked down, but not restarted.
   2. The default livenessProbe reads `/solr/admin/info/system`, which is still 
accessible and fine and doesn't cause any restart behaviour.
   
   I think the easiest fix, would be to change the livenessProbe to also read 
the `/health` endpoint, so it forces a pod restart when that starts failing.
   I have overridden this in my own deployment and it works as expected, but I 
did want to raise it in case someone else runs into it, or in case I am missing 
something and there is a better fix. Here are my overrides;
   
   ```
   podOptions:
     livenessProbe:
       httpGet:
         path: /solr/admin/info/health
         port: 8983
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to