Ishan, as I pointed out in Jira I don’t care for you implying that I have evil 
intentions, I resent also your implication that I’m behaving irrationally or 
don’t care for the users. Those of you who are interested may read the comments 
in Jira and judge for themselves.

You conveniently don’t mention that I WITHDREW my objection, and instead 
proposed a lenient validation (but validation nonetheless!). It’s easy to 
scream “revert! revert!” but it actually takes some consideration to properly 
address the original purpose of this change - that is, detecting and avoiding 
the corruption of replica state. Let’s focus on this and not on pointing 
fingers.

As for the production outage - I’m sorry this happened to you. As I hope you 
and Noble and others are sorry for other inadvertently introduced bugs, which 
I’m sure brought down many clusters at inconvenient hours... 


> On 18 May 2021, at 13:26, Ishan Chattopadhyaya <[email protected]> 
> wrote:
> 
> https://issues.apache.org/jira/browse/SOLR-14245 
> <https://issues.apache.org/jira/browse/SOLR-14245>
> 
> There was a production outage at odd hours at my (and Noble's) client, due to 
> this above change in Solr 8.5 onwards by Andrzej Bialecki.
> 
> In short, there is some bug in Solr where a replica gets "null" as the 
> node_name (upon invocation of a collection API command). On the rare 
> occasions where we encountered such situations in the past, the replica would 
> be unavailable and the system would work fine overall. However, this change 
> (which introduces strict validation of errors while *reading* Replica 
> objects) now means that if such a situation arises (where some Solr's APIs 
> itself results in node_name being null in a state.json), all SolrJ clients 
> and all Solr nodes will go for a toss (possibly crash, and not start back up).
> 
> This change was rushed in, without any discussions or review, without 
> extensive testing for the failures it will cause on existing systems where 
> cluster state is messed up but system is running, and without any 
> consideration for the impact on users.
> 
> Noble and I are of the opinion that this change should be reverted 
> immediately, considering the impact to users. However, there is strong 
> disagreement on Andrzej's part.
> 
> Mistakes happen, but doubling down on them irrationally [1] will destroy the 
> reputation of the project, let alone the peace of mind of those who are 
> running Solr in production.
> 
> Does someone have any thoughts or opinions?
> 
> [1] - 
> https://issues.apache.org/jira/browse/SOLR-14245?focusedCommentId=17346758&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17346758
>  
> <https://issues.apache.org/jira/browse/SOLR-14245?focusedCommentId=17346758&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17346758>

Reply via email to