[ 
https://issues.apache.org/jira/browse/SOLR-16722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17706376#comment-17706376
 ] 

Jan Høydahl commented on SOLR-16722:
------------------------------------

I did some research of current shutdown logic, and while I have not un-tangled 
the exact chain of events, I see in CoreContainer#shutdown() that it first 
shuts down all cores, and then calls zkController#preClose, which publishes 
node as down in ZK. 
[https://github.com/apache/solr/blob/main/solr/core/src/java/org/apache/solr/core/CoreContainer.java#L1197-L1223]

Should not live_nodes znode be removed first, then pause, then sthudown all 
cores? Now, it could be that Jetty rejects traffic to the servlet before 
CC#shutdown is being called, so that wouldn't work. Since node shutdown is 
initiated from Jetty (signal to STOP_PORT), then we'd need to un-publish the 
node in ZK in some kind of pre-shutdown-hook, but have not got to that.

In this code 
[https://github.com/apache/solr/blob/main/solr/core/src/java/org/apache/solr/servlet/CoreContainerProvider.java#L159-L196]
 in {{close()}} there is an interesting comment in this regard:
{quote}    // Mark Miller suggested that we should be publishing that we are 
down before anything else
    // which makes good sense, but the following causes test failures, so that 
improvement can be
    // the subject of another PR/issue. Also, jetty might already be refusing 
requests by this point
    // so that's a potential issue too. Digging slightly I see that there's a 
whole mess of code
    // looking up collections and calculating state changes associated with 
this call, which smells
    // a lot like we're duplicating node state in collection stuff, but it will 
take a lot of code
    // reading to figure out if that's really what it is, why we did it and if 
there's room for
    // improvement.
    //    if (cc != null) {
    //      ZkController zkController = cc.getZkController();
    //      if (zkController != null) {
    //        zkController.publishNodeAsDown(zkController.getNodeName());
    //      }
    //    }
{quote}
But I think that at the time that CCProvider receives {{contextDestroyed()}} 
the servlet is already gone, see [comment in 
ServletContextListener|https://tomcat.apache.org/tomcat-8.0-doc/servletapi/javax/servlet/ServletContextListener.html].

> API to flag a solr node NOT READY for requests
> ----------------------------------------------
>
>                 Key: SOLR-16722
>                 URL: https://issues.apache.org/jira/browse/SOLR-16722
>             Project: Solr
>          Issue Type: New Feature
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Jan Høydahl
>            Priority: Major
>
> Spinoff from solr operator PR 
> [https://github.com/apache/solr-operator/issues/529]
> When solr-operator performs a rolling restart or rolling upgrade, it will 
> stop one node at a time, but SolrJ (both external and internal) will continue 
> sending traffic to the node until requests start failing, since at the time 
> SolrJ picks up the "live_nodes" change, it is too late.
> While the operator PR mentioned above will prevent external requests through 
> the k8s service to the draining node, it will not prevent internal traffic.
> This issue thus aims to introduce some API or mechanism to flag a Solr node 
> as NOT READY for traffic.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to