[ 
https://issues.apache.org/jira/browse/SOLR-17765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris M. Hostetter updated SOLR-17765:
--------------------------------------
    Attachment: SOLR-17765.patch
        Status: Open  (was: Open)

A trivial attempt to uncomment this code and run all tests w/multiple seeds 
only surfaced a single test failure seemed toe be related: {{ZkFailoverTest}} 
would error for some seeds (related to PRS randomization IIUC?) during  
{{MiniSolrCloudCluster}} shutdown at the end of the test.  The exception thrown 
would from the (now uncommented) call to 
{{zkController.publishNodeAsDown(...)}} , because the {{ZkTestServer}} had been 
shutdown, causing a {{SolrException}} from the low level {{ZkStateReader}} code.

 

In my attached patch, I've made two small additions to the originally commented 
code:
 * a quick check of {{getZkClient().isConnected()}} in 
{{CoreContainerProvider}} _before_ calling {{publishNodeAsDown()}}
 * _inside_ of {{publishNodeAsDown()}} I {{SolrException}} to the existing 
"warm on ZK exceptions and treat as No-Op" since the javadocs for the method 
explicitly say {{Best effort to set DOWN state...}} suggesting that underlying 
SolrExceptions should not be propagated.

 

/ping [~markrmil...@gmail.com] & [~gus] given there previous work on this code

> Nodes should publish themselves as DOWN ASAP during shutdown
> ------------------------------------------------------------
>
>                 Key: SOLR-17765
>                 URL: https://issues.apache.org/jira/browse/SOLR-17765
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Chris M. Hostetter
>            Priority: Major
>         Attachments: SOLR-17765.patch
>
>
> While working on SOLR-17744 i noticed this comment in 
> {{CoreContainerProvider.close()}} (that seems to date back to SOLR-15590) ...
> {noformat}
> // Mark Miller suggested that we should be publishing that we are down before 
> anything else
> // which makes good sense, but the following causes test failures, so that 
> improvement can be
> // the subject of another PR/issue. Also, jetty might already be refusing 
> requests by this point
> // so that's a potential issue too. Digging slightly I see that there's a 
> whole mess of code
> // looking up collections and calculating state changes associated with this 
> call, which smells
> // a lot like we're duplicating node state in collection stuff, but it will 
> take a lot of code
> // reading to figure out if that's really what it is, why we did it and if 
> there's room for
> // improvement.
> //    if (cc != null) {
> //      ZkController zkController = cc.getZkController();
> //      if (zkController != null) {
> //        zkController.publishNodeAsDown(zkController.getNodeName());
> //      }
> //    }
> {noformat}
> ...I'm creating this Jira because I see no other existing Jira addressing 
> this idea.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

Reply via email to