[
https://issues.apache.org/jira/browse/SOLR-16722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17706376#comment-17706376
]
Jan Høydahl commented on SOLR-16722:
------------------------------------
I did some research of current shutdown logic, and while I have not un-tangled
the exact chain of events, I see in CoreContainer#shutdown() that it first
shuts down all cores, and then calls zkController#preClose, which publishes
node as down in ZK.
[https://github.com/apache/solr/blob/main/solr/core/src/java/org/apache/solr/core/CoreContainer.java#L1197-L1223]
Should not live_nodes znode be removed first, then pause, then sthudown all
cores? Now, it could be that Jetty rejects traffic to the servlet before
CC#shutdown is being called, so that wouldn't work. Since node shutdown is
initiated from Jetty (signal to STOP_PORT), then we'd need to un-publish the
node in ZK in some kind of pre-shutdown-hook, but have not got to that.
In this code
[https://github.com/apache/solr/blob/main/solr/core/src/java/org/apache/solr/servlet/CoreContainerProvider.java#L159-L196]
in {{close()}} there is an interesting comment in this regard:
{quote} // Mark Miller suggested that we should be publishing that we are
down before anything else
// which makes good sense, but the following causes test failures, so that
improvement can be
// the subject of another PR/issue. Also, jetty might already be refusing
requests by this point
// so that's a potential issue too. Digging slightly I see that there's a
whole mess of code
// looking up collections and calculating state changes associated with
this call, which smells
// a lot like we're duplicating node state in collection stuff, but it will
take a lot of code
// reading to figure out if that's really what it is, why we did it and if
there's room for
// improvement.
// if (cc != null) {
// ZkController zkController = cc.getZkController();
// if (zkController != null) {
// zkController.publishNodeAsDown(zkController.getNodeName());
// }
// }
{quote}
But I think that at the time that CCProvider receives {{contextDestroyed()}}
the servlet is already gone, see [comment in
ServletContextListener|https://tomcat.apache.org/tomcat-8.0-doc/servletapi/javax/servlet/ServletContextListener.html].
> API to flag a solr node NOT READY for requests
> ----------------------------------------------
>
> Key: SOLR-16722
> URL: https://issues.apache.org/jira/browse/SOLR-16722
> Project: Solr
> Issue Type: New Feature
> Security Level: Public(Default Security Level. Issues are Public)
> Reporter: Jan Høydahl
> Priority: Major
>
> Spinoff from solr operator PR
> [https://github.com/apache/solr-operator/issues/529]
> When solr-operator performs a rolling restart or rolling upgrade, it will
> stop one node at a time, but SolrJ (both external and internal) will continue
> sending traffic to the node until requests start failing, since at the time
> SolrJ picks up the "live_nodes" change, it is too late.
> While the operator PR mentioned above will prevent external requests through
> the k8s service to the draining node, it will not prevent internal traffic.
> This issue thus aims to introduce some API or mechanism to flag a Solr node
> as NOT READY for traffic.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]