[ https://issues.apache.org/jira/browse/SOLR-17765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Chris M. Hostetter updated SOLR-17765: -------------------------------------- Attachment: SOLR-17765.patch Status: Open (was: Open) A trivial attempt to uncomment this code and run all tests w/multiple seeds only surfaced a single test failure seemed toe be related: {{ZkFailoverTest}} would error for some seeds (related to PRS randomization IIUC?) during {{MiniSolrCloudCluster}} shutdown at the end of the test. The exception thrown would from the (now uncommented) call to {{zkController.publishNodeAsDown(...)}} , because the {{ZkTestServer}} had been shutdown, causing a {{SolrException}} from the low level {{ZkStateReader}} code. In my attached patch, I've made two small additions to the originally commented code: * a quick check of {{getZkClient().isConnected()}} in {{CoreContainerProvider}} _before_ calling {{publishNodeAsDown()}} * _inside_ of {{publishNodeAsDown()}} I {{SolrException}} to the existing "warm on ZK exceptions and treat as No-Op" since the javadocs for the method explicitly say {{Best effort to set DOWN state...}} suggesting that underlying SolrExceptions should not be propagated. /ping [~markrmil...@gmail.com] & [~gus] given there previous work on this code > Nodes should publish themselves as DOWN ASAP during shutdown > ------------------------------------------------------------ > > Key: SOLR-17765 > URL: https://issues.apache.org/jira/browse/SOLR-17765 > Project: Solr > Issue Type: Bug > Reporter: Chris M. Hostetter > Priority: Major > Attachments: SOLR-17765.patch > > > While working on SOLR-17744 i noticed this comment in > {{CoreContainerProvider.close()}} (that seems to date back to SOLR-15590) ... > {noformat} > // Mark Miller suggested that we should be publishing that we are down before > anything else > // which makes good sense, but the following causes test failures, so that > improvement can be > // the subject of another PR/issue. Also, jetty might already be refusing > requests by this point > // so that's a potential issue too. Digging slightly I see that there's a > whole mess of code > // looking up collections and calculating state changes associated with this > call, which smells > // a lot like we're duplicating node state in collection stuff, but it will > take a lot of code > // reading to figure out if that's really what it is, why we did it and if > there's room for > // improvement. > // if (cc != null) { > // ZkController zkController = cc.getZkController(); > // if (zkController != null) { > // zkController.publishNodeAsDown(zkController.getNodeName()); > // } > // } > {noformat} > ...I'm creating this Jira because I see no other existing Jira addressing > this idea. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org