Hello,

I have been looking at a previous investigation we had about an unexpected
behaviour where a node was taking traffic for a replica that was not ready
to take it. It seems to happen when the node is marked as live and the
replica is marked as active, while the corresponding core was not loaded
yet on the node.

I looked at the code and in theory it should not happen, since the
following happens in ZkController#init: mark node as down, wait for
replicas to be marked as down, and then register the node as live. However,
after looking at the code of publishAndWaitForDownStates, I observed that
we wait for down states for replicas associated with cores as returned by
CoreContainer#getCoreDescriptors... which is empty at this point since
ZkController#init is called before cores are discovered (which happens
later in CoreContainer#load).

It hence seems to me that we basically never wait for any replicas to be
marked as down, and continue the startup sequence by marking the node as
live, and hence *might* take traffic for a short period of time for a
replica that is not ready (e.g., if the node previously crashed and the
replica stayed active).

As I am new to investigating this kind of stuff in Solr Cloud, I want to
share my findings and get feedback about whether it was possibly correct
(in which case I'd be happy to contribute a bug fix), or whether I was
missing something else.

Thank you,

Vincent Primault.

Reply via email to