[ 
https://issues.apache.org/jira/browse/SOLR-17049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17779897#comment-17779897
 ] 

Vincent Primault commented on SOLR-17049:
-----------------------------------------

Thread on [email protected]: 
[https://lists.apache.org/thread/3q5t2kxbpq7poc6nb06qgs1gld2f6ny0]

I could see two ways of fixing this:
 * By relying on cluster state to see which collections have a local replica
 * By relying on CoresLocator to be consistent with what is done at startup

> Marking replicas down at startup and waiting does not wait
> ----------------------------------------------------------
>
>                 Key: SOLR-17049
>                 URL: https://issues.apache.org/jira/browse/SOLR-17049
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>    Affects Versions: 8.6
>            Reporter: Vincent Primault
>            Priority: Major
>
> We observed an unexpected behaviour where a node was taking traffic for a 
> replica that was not ready to take it. It seems to happen when the node is 
> marked as live and the replica is marked as active, while the corresponding 
> core is not loaded yet on the node.
>  
> I looked at the code and in theory it should not happen, since the following 
> happens in {{{}ZkController#init{}}}: mark node as down, wait for replicas to 
> be marked as down, and then register the node as live. However, after looking 
> at the code of {{{}publishAndWaitForDownStates{}}}, I observed that we wait 
> for down states for replicas associated with cores as returned by 
> {{{}CoreContainer#getCoreDescriptors{}}}... which is empty at this point 
> since {{ZkController#init}} is called before cores are discovered (which 
> happens later in {{{}CoreContainer#load{}}}).
>  
> It hence seems to me that we basically never wait for any replicas to be 
> marked as down, and continue the startup sequence by marking the node as 
> live, and hence _might_ take traffic for a short period of time for a replica 
> that is not ready (e.g., if the node previously crashed and the replica 
> stayed active).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to