Hello, We encountered an interesting bug (?) while doing TLS deployment for existing Solr Clouds running 9.8, and I'm curious if others have seen anything similar. I'll stress that our goal was to limit, ideally eliminate, downtime during this process.
Here's a breakdown of the problem: 1. Leader is restarted *not last*, leadership can potentially shift to a node that has not yet been converted to TLS which lacks the necessary keystore and configuration. 2. Despite the new leader not being TLS-enabled, ShardLeaderElectionContextBase::runLeaderProcess is invoked. This function, in turn, calls zkStateReader::getBaseUrlForNodeName, which reads the cluster-wide urlScheme property. If urlScheme has already been updated to https (indicating the cluster is moving to TLS), the base URL for the new leader is incorrectly changed from http to https. 3. As the new leader is not actually configured for TLS, the collection enters a leaderless state, and the election result is not reflected in state.json. Analyzing this further we discovered that ZkController::publish (during reconnect/recovery) can potentially also force an incorrect urlScheme update on a node that is not yet ready for TLS. The issue is more likely to occur on clouds with multiple collections, where leaders for each collection are distributed across different nodes. This is because this scenario is *only* triggered when a leader node is restarted before other nodes for a collection during TLS roll-out. You can, of course, try to force leadership to the same node but this comes with deployment complexity overhead and doesn't solve the case of leadership changing spontaneously. To address this issue, we have developed a patch that skips base URL update during leader election (relies on the previous base URL whenever possible). The ZkController::joinElection code path, conveniently invoked during startup, is then solely responsible to update the base URL. From our testing, this seems to be the safest behavior during a rolling TLS deployment. While our testing indicates this patch resolves the immediate issue, it raises some broader questions: 1. Is our proposed patch a reasonable solution, or are we overlooking other critical factors? 2. Is it generally advisable to co-locate multiple collections on the same Solr processes? Beyond this limited corner case, we understand there may be bigger potential problems of such a set-up. For instance, non-forward compatible Lucene codec upgrades (e.g., older tlogs attempting to read segments written with newer codecs) can prove problematic. Best, Luke