Hi Vincent, I have seen that behavior, node gets re provisioned, replica on that node is back up live and zk starts routing traffic, however the response time from that replica is really high for a short period.
Worked around it by adding some hundreds of warming queries which puts the replica into recovery until all the queries are replayed and hence delay the live state. But yeah not a good solution as this always puts replica into recovery for minutes which may not be required, if it's what you said is issue in the replica being live before the core is loaded. Thank you, Rajani On Thu, Oct 5, 2023, 3:26 AM Vincent Primault <vprima...@salesforce.com.invalid> wrote: > Hello, > > I have been looking at a previous investigation we had about an unexpected > behaviour where a node was taking traffic for a replica that was not ready > to take it. It seems to happen when the node is marked as live and the > replica is marked as active, while the corresponding core was not loaded > yet on the node. > > I looked at the code and in theory it should not happen, since the > following happens in ZkController#init: mark node as down, wait for > replicas to be marked as down, and then register the node as live. However, > after looking at the code of publishAndWaitForDownStates, I observed that > we wait for down states for replicas associated with cores as returned by > CoreContainer#getCoreDescriptors... which is empty at this point since > ZkController#init is called before cores are discovered (which happens > later in CoreContainer#load). > > It hence seems to me that we basically never wait for any replicas to be > marked as down, and continue the startup sequence by marking the node as > live, and hence *might* take traffic for a short period of time for a > replica that is not ready (e.g., if the node previously crashed and the > replica stayed active). > > As I am new to investigating this kind of stuff in Solr Cloud, I want to > share my findings and get feedback about whether it was possibly correct > (in which case I'd be happy to contribute a bug fix), or whether I was > missing something else. > > Thank you, > > Vincent Primault. >