Thanks a ton for this contribution Houston. I tried to work on this myself but it seemed pretty complicated, I could only spot the issue in the ZkController but not the rest of the workflow. I couldn't tell where to implement the getReplicaNamesPerCollectionOnNode method. Appreciate your time and effort. This PR was a great learning experience for me and I look forward to this feature being merged.
This is definitely a major bug fix for large size collections ( > 50 shards) hosted on multiple nodes. Any node restart was causing at least 2-4 minutes of requests to fail which is not acceptable for any cluster serving more than a 100s of requests per second. The work around was not feasible. Thank you, Rajani On Tue, Apr 30, 2024 at 3:42 PM Houston Putman <hous...@apache.org> wrote: > I've created a PR to address this: > https://github.com/apache/solr/pull/2432 > > Open to other ways of approaching it though. > > - Houston > > On Tue, Apr 30, 2024 at 4:44 AM Mark Miller <markrmil...@gmail.com> wrote: > > > There is a publish node as down and wait method that just waits until > then > > down states show up in the cluster state. But waiting won't do any good > > until down is actually published and it still is not. I'm pretty down has > > never been published on startup despite appearances. I've seen two > > ramifications from this. One is that it's much easier for replicas to get > > out of sync when restarting a cluster while updates are coming in. I say > > easier, because that's notnair tight regardless, but it does make it much > > easier to happen. The second is that cores that are not ready can > > participate in leader elections, receiving updates and queries. And if a > > core fails to load, it will participate indefinitely. Retries and fault > > tolerance will plaster over a good chunk of that though, less so for > leader > > election when a core fails to load > > >