Draining a Solr node for traffic before shutting down

Jan Høydahl Wed, 29 Mar 2023 06:16:56 -0700

Hi,

Trying to prevent traffic being sent to a Solr node that is going to shut down, 
to avoid interruption of service as seen from various clients.
First part of the puzzle is signaling to any (external) load balancer to stop 
sending requests to the node.
The other part is having SolrJ understand that the node is being stopped, and 
not routing internal requests to cores on the node.


Does anyone have a good command of the Shutdown logic in Solr?
My understanding is a bit sparse, but here's what I can see in the code: 

bin/solr stop will send a STOP command to Jetty's STOP_PORT with 
(not-so-secret) stop key
Jetty starts the shutdown process, destroying all servlets and filters, 
including Solr's dispatchFilter
Solr is notified about the shutdown through a callback in CoreContainerProvider.
CoreContainerProvider#close() is called which calls CC#shutdown
CC shuts down every core on the node and then calls zkController#preClose
ZkController#preClose removes ephemeral live_nodes/myNode and then publishes 
down state in state.json
Wait for shutdown of executors mm and let Jetty exit

I could have got it wrong though.

I was hoping that a Solr node would first publish itself as "not ready" in ZK 
before rejecting requests, but seems as this is all reversed, since shutdown is 
initiated by Jetty?
So could we instead register our own shutdown-port in Solr, and let our 
bin/solr script trigger that one? There we could orchestrate the shutdown as we 
want:

Remove live_nodes znode in ZK
Publish itself as not ready on api/node/health handler (or a new 
api/node/ready?)
Sleep for a few seconds (or longer with an optional &shutdownDelay argument to 
our shutdown endpoint)
trigger server.stop() to take down Jetty and kill the servlet

I filed https://issues.apache.org/jira/browse/SOLR-16722 to discuss a technical 
solution.
The primary goal is to drain traffic right before shutting a node down, but it 
could also be designed as a generic Readiness Probe 
<https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#define-readiness-probes>
 modeled from Kubernetes?
I'm also aware that any solr client should be prepared to hit a dead node due 
to network/power events, and retry. But it won't hurt to be graceful whenever 
we can..

Happy to hear your thoughts. Is this a made-up problem?

Jan

Draining a Solr node for traffic before shutting down

Reply via email to