Hello,

We encountered an interesting bug (?) while doing TLS deployment for existing 
Solr Clouds running 9.8, and I'm curious if others have seen anything similar. 
I'll stress that our goal was to limit, ideally eliminate, downtime during this 
process.

Here's a breakdown of the problem:

1. Leader is restarted *not last*, leadership can potentially shift to a node 
that has not yet been converted to TLS which lacks the necessary keystore and 
configuration.

2. Despite the new leader not being TLS-enabled, 
ShardLeaderElectionContextBase::runLeaderProcess is invoked. This function, in 
turn, calls zkStateReader::getBaseUrlForNodeName, which reads the cluster-wide 
urlScheme property. If urlScheme has already been updated to https (indicating 
the cluster is moving to TLS), the base URL for the new leader is incorrectly 
changed from http to https.

3. As the new leader is not actually configured for TLS, the collection enters 
a leaderless state, and the election result is not reflected in state.json.

Analyzing this further we discovered that ZkController::publish (during 
reconnect/recovery) can potentially also force an incorrect urlScheme update on 
a node that is not yet ready for TLS.

The issue is more likely to occur on clouds with multiple collections, where 
leaders for each collection are distributed across different nodes. This is 
because this scenario is *only* triggered when a leader node is restarted 
before other nodes for a collection during TLS roll-out. You can, of course, 
try to force leadership to the same node but this comes with deployment 
complexity overhead and doesn't solve the case of leadership changing 
spontaneously.

To address this issue, we have developed a patch that skips base URL update 
during leader election (relies on the previous base URL whenever possible). The 
ZkController::joinElection code path, conveniently invoked during startup, is 
then solely responsible to update the base URL. From our testing, this seems to 
be the safest behavior during a rolling TLS deployment.

While our testing indicates this patch resolves the immediate issue, it raises 
some broader questions:

1. Is our proposed patch a reasonable solution, or are we overlooking other 
critical factors?

2. Is it generally advisable to co-locate multiple collections on the same Solr 
processes? Beyond this limited corner case, we understand there may be bigger 
potential problems of such a set-up. For instance, non-forward compatible 
Lucene codec upgrades (e.g., older tlogs attempting to read segments written 
with newer codecs) can prove problematic.


Best,
Luke

Reply via email to