[ 
https://issues.apache.org/jira/browse/SOLR-15386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17337613#comment-17337613
 ] 

David Smiley commented on SOLR-15386:
-------------------------------------

It may be enough to simply double-check the node is still down both before & 
+after+ NodeMutator.downNode returns it's list of operations.

As I look in Overseer's ClusterStateUpdater and related places, it's a bit 
unclear what happens when a state change fails due to a bad version.  I see 
{{org.apache.solr.cloud.overseer.ZkStateWriter#invalidState}} and I see 
exceptions are propagated.  But it's not clear if there is a re-try.  Maybe the 
message is re-processed?

Assuming when a node comes back up, it always updates the corresponding state 
(even if it is already marked up), then this would trigger the conditional 
update resulting from DOWNNODE to fail and it'd have to re-examine the live 
nodes and ultimately bail (what we want).

> Internal DOWNNODE request will mark replicas down even if their host node is 
> now live
> -------------------------------------------------------------------------------------
>
>                 Key: SOLR-15386
>                 URL: https://issues.apache.org/jira/browse/SOLR-15386
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: SolrCloud
>    Affects Versions: 8.6
>            Reporter: Megan Carey
>            Priority: Major
>
> When a node is shutting down, it calls into:
>  # 
> [CoreContainer.shutdown()|https://github.com/apache/lucene-solr/blob/branch_8_8/solr/core/src/java/org/apache/solr/core/CoreContainer.java#L1026]
>  # 
> [ZkController.preClose()|https://github.com/apache/lucene-solr/blob/branch_8_8/solr/core/src/java/org/apache/solr/cloud/ZkController.java#L612]
>  # 
> [ZkController.publishNodeAsDown|https://github.com/apache/lucene-solr/blob/branch_8_8/solr/core/src/java/org/apache/solr/cloud/ZkController.java#L2753]
> This sends a request to Overseer to mark all of the replicas DOWN for the 
> soon-to-be down node.
> # 
> [Overseer.processMessage()|https://github.com/apache/lucene-solr/blob/branch_8_8/solr/core/src/java/org/apache/solr/cloud/Overseer.java#L459]
> # 
> [NodeMutator.downNode()|https://github.com/apache/lucene-solr/blob/branch_8_8/solr/core/src/java/org/apache/solr/cloud/overseer/NodeMutator.java#L48]
> The issue we encountered was as follows:
> # Solr node shuts down
> # DOWNNODE message is enqueued for Overseer
> # Solr node comes back up (running on K8s, so a new node is auto-started as 
> soon as the old node was detected as down)
> # DOWNNODE was dequeued for processing, and marked all replicas DOWN for the 
> node that is now live.
> The only place where these replicas would later be marked ACTIVE again is 
> after ShardLeaderElection, but we did not reach that case. An easy fix is to 
> add a check for node liveness prior to marking replicas down, but a lot of 
> tests fail with this change. Was this the intended functionality? 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to