Re: live_nodes and state.json can get out of sync

Mark Miller Thu, 14 Jan 2016 14:28:26 -0800

Both sound like kind of hacky workarounds to bugs to me.

For #1, we should not add api's to fix live nodes. I'd try to reproduce -
manually removing the live node is no real help. We need this to work
though, not an api to try and patch a possible bug.


For #2, I think there is an open issue about taking a replica offline for
various reasons. Perhaps that could be used. That should probably disable
any new recovery attempts as well as making the replica inactive in
clusterstate until the replica is put back online.

- Mark

On Thu, Jan 14, 2016 at 4:23 PM Erick Erickson <[email protected]>
wrote:

> We've seen at least two cases "in the wild" where a Solr node is in
> fine shape, but live_nodes does NOT list a Solr node and the
> corresponding state.json for that node shows it as "active".
>
> Furthermore, sending queries directly to the core on the machine in
> question with distrib=false generates a correct response so Solr is
> indeed "live". AFAIK, there's no way to get that Solr node _back_ into
> live_nodes without bouncing the server, but that can be disruptive.
>
> I've reproduced this situation locally and can confirm that the
> live_nodes entry never comes back. To reproduce it though, I had to:
> 1> create a collection
> 2> nuke the live_nodes ZNode
>
> Unfortunately, we don't know how to reproduce the original _real_
> condition that cause this in the first place....
>
> Other than manually editing the znode, is there any other way to
> reinsert the node in live_nodes? If not, what do people think about a
> Collections API that did this? I'm thinking of a command that would
> fail unless it was sent to the node that was re-inserting itself. That
> way if the node was truly down it couldn't get re-inserted
> inappropriately.
>
> Or Solr nodes could periodically query Zookeeper to see if they were
> appropriately in live_nodes, but that seems like a lot of work for
> something that's apparently _very_ rare. I'm also not sure the node in
> question is receiving events from ZK, so I don't think even watching
> it's own node is a foolproof way of a Solr node being taken out of the
> live_nodes inappropriately being able to re-insert itself.
>
>
> *******************
> Second issue. In extremely heavy indexing situations, replicas will
> never catch up to a leader if for some reason they go into recovery.
> Of course if all the replicas for a shard go down, everything grinds
> to a halt.
>
> What do people think about an option to essentially toggle whether
> recoveries are even attempted? Yet another Collections API perhaps,
> DISABLERECOVERY=true|false. The case in point is a situation that
> indexes over 1M docs/second. Or maybe this is a property on the
> collection in ZK that you could change with MODIFYCOLLECTION and
> specify on CREATE. Actually, I like this latter a lot better than
> proliferating another API action.
>
> Yes, that puts data integrity at risk since eventually you get to a
> leader-only shard. But that's already at risk since the replicas
> demonstrably never catch up.
>
> Of course the default state would be to always do the recovery as we
> do now. For installations that saw this periodically happen, they
> could change the option during an indexing lull, allow recovery then
> change the property back.
>
> Not entirely sure what I think of the idea at all, but again this is
> something we're seeing in the wild.
>
> I'll raise JIRAs unless the ideas get shot down.
>
> Erick
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
> --
- Mark
about.me/markrmiller

Re: live_nodes and state.json can get out of sync

Reply via email to