[ 
https://issues.apache.org/jira/browse/SOLR-15386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mayya Sharipova updated SOLR-15386:
-----------------------------------
    Security:     (was: Public)

> Internal DOWNNODE request will mark replicas down even if their host node is 
> now live
> -------------------------------------------------------------------------------------
>
>                 Key: SOLR-15386
>                 URL: https://issues.apache.org/jira/browse/SOLR-15386
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>    Affects Versions: 8.6
>            Reporter: Megan Carey
>            Priority: Major
>
> When a node is shutting down, it calls into:
>  # 
> [CoreContainer.shutdown()|https://github.com/apache/lucene-solr/blob/branch_8_8/solr/core/src/java/org/apache/solr/core/CoreContainer.java#L1026]
>  # 
> [ZkController.preClose()|https://github.com/apache/lucene-solr/blob/branch_8_8/solr/core/src/java/org/apache/solr/cloud/ZkController.java#L612]
>  # 
> [ZkController.publishNodeAsDown|https://github.com/apache/lucene-solr/blob/branch_8_8/solr/core/src/java/org/apache/solr/cloud/ZkController.java#L2753]
> This sends a request to Overseer to mark all of the replicas DOWN for the 
> soon-to-be down node.
> # 
> [Overseer.processMessage()|https://github.com/apache/lucene-solr/blob/branch_8_8/solr/core/src/java/org/apache/solr/cloud/Overseer.java#L459]
> # 
> [NodeMutator.downNode()|https://github.com/apache/lucene-solr/blob/branch_8_8/solr/core/src/java/org/apache/solr/cloud/overseer/NodeMutator.java#L48]
> The issue we encountered was as follows:
> # Solr node shuts down
> # DOWNNODE message is enqueued for Overseer
> # Solr node comes back up (running on K8s, so a new node is auto-started as 
> soon as the old node was detected as down)
> # DOWNNODE was dequeued for processing, and marked all replicas DOWN for the 
> node that is now live.
> The only place where these replicas would later be marked ACTIVE again is 
> after ShardLeaderElection, but we did not reach that case. An easy fix is to 
> add a check for node liveness prior to marking replicas down, but a lot of 
> tests fail with this change. Was this the intended functionality? 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to