We've seen at least two cases "in the wild" where a Solr node is in fine shape, but live_nodes does NOT list a Solr node and the corresponding state.json for that node shows it as "active".
Furthermore, sending queries directly to the core on the machine in question with distrib=false generates a correct response so Solr is indeed "live". AFAIK, there's no way to get that Solr node _back_ into live_nodes without bouncing the server, but that can be disruptive. I've reproduced this situation locally and can confirm that the live_nodes entry never comes back. To reproduce it though, I had to: 1> create a collection 2> nuke the live_nodes ZNode Unfortunately, we don't know how to reproduce the original _real_ condition that cause this in the first place.... Other than manually editing the znode, is there any other way to reinsert the node in live_nodes? If not, what do people think about a Collections API that did this? I'm thinking of a command that would fail unless it was sent to the node that was re-inserting itself. That way if the node was truly down it couldn't get re-inserted inappropriately. Or Solr nodes could periodically query Zookeeper to see if they were appropriately in live_nodes, but that seems like a lot of work for something that's apparently _very_ rare. I'm also not sure the node in question is receiving events from ZK, so I don't think even watching it's own node is a foolproof way of a Solr node being taken out of the live_nodes inappropriately being able to re-insert itself. ******************* Second issue. In extremely heavy indexing situations, replicas will never catch up to a leader if for some reason they go into recovery. Of course if all the replicas for a shard go down, everything grinds to a halt. What do people think about an option to essentially toggle whether recoveries are even attempted? Yet another Collections API perhaps, DISABLERECOVERY=true|false. The case in point is a situation that indexes over 1M docs/second. Or maybe this is a property on the collection in ZK that you could change with MODIFYCOLLECTION and specify on CREATE. Actually, I like this latter a lot better than proliferating another API action. Yes, that puts data integrity at risk since eventually you get to a leader-only shard. But that's already at risk since the replicas demonstrably never catch up. Of course the default state would be to always do the recovery as we do now. For installations that saw this periodically happen, they could change the option during an indexing lull, allow recovery then change the property back. Not entirely sure what I think of the idea at all, but again this is something we're seeing in the wild. I'll raise JIRAs unless the ideas get shot down. Erick --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
