We've seen at least two cases "in the wild" where a Solr node is in
fine shape, but live_nodes does NOT list a Solr node and the
corresponding state.json for that node shows it as "active".

Furthermore, sending queries directly to the core on the machine in
question with distrib=false generates a correct response so Solr is
indeed "live". AFAIK, there's no way to get that Solr node _back_ into
live_nodes without bouncing the server, but that can be disruptive.

I've reproduced this situation locally and can confirm that the
live_nodes entry never comes back. To reproduce it though, I had to:
1> create a collection
2> nuke the live_nodes ZNode

Unfortunately, we don't know how to reproduce the original _real_
condition that cause this in the first place....

Other than manually editing the znode, is there any other way to
reinsert the node in live_nodes? If not, what do people think about a
Collections API that did this? I'm thinking of a command that would
fail unless it was sent to the node that was re-inserting itself. That
way if the node was truly down it couldn't get re-inserted
inappropriately.

Or Solr nodes could periodically query Zookeeper to see if they were
appropriately in live_nodes, but that seems like a lot of work for
something that's apparently _very_ rare. I'm also not sure the node in
question is receiving events from ZK, so I don't think even watching
it's own node is a foolproof way of a Solr node being taken out of the
live_nodes inappropriately being able to re-insert itself.


*******************
Second issue. In extremely heavy indexing situations, replicas will
never catch up to a leader if for some reason they go into recovery.
Of course if all the replicas for a shard go down, everything grinds
to a halt.

What do people think about an option to essentially toggle whether
recoveries are even attempted? Yet another Collections API perhaps,
DISABLERECOVERY=true|false. The case in point is a situation that
indexes over 1M docs/second. Or maybe this is a property on the
collection in ZK that you could change with MODIFYCOLLECTION and
specify on CREATE. Actually, I like this latter a lot better than
proliferating another API action.

Yes, that puts data integrity at risk since eventually you get to a
leader-only shard. But that's already at risk since the replicas
demonstrably never catch up.

Of course the default state would be to always do the recovery as we
do now. For installations that saw this periodically happen, they
could change the option during an indexing lull, allow recovery then
change the property back.

Not entirely sure what I think of the idea at all, but again this is
something we're seeing in the wild.

I'll raise JIRAs unless the ideas get shot down.

Erick

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to