Hello, I have been looking at a previous investigation we had about an unexpected behaviour where a node was taking traffic for a replica that was not ready to take it. It seems to happen when the node is marked as live and the replica is marked as active, while the corresponding core was not loaded yet on the node.
I looked at the code and in theory it should not happen, since the following happens in ZkController#init: mark node as down, wait for replicas to be marked as down, and then register the node as live. However, after looking at the code of publishAndWaitForDownStates, I observed that we wait for down states for replicas associated with cores as returned by CoreContainer#getCoreDescriptors... which is empty at this point since ZkController#init is called before cores are discovered (which happens later in CoreContainer#load). It hence seems to me that we basically never wait for any replicas to be marked as down, and continue the startup sequence by marking the node as live, and hence *might* take traffic for a short period of time for a replica that is not ready (e.g., if the node previously crashed and the replica stayed active). As I am new to investigating this kind of stuff in Solr Cloud, I want to share my findings and get feedback about whether it was possibly correct (in which case I'd be happy to contribute a bug fix), or whether I was missing something else. Thank you, Vincent Primault.