[
https://issues.apache.org/jira/browse/SOLR-8034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jessica Cheng Mallet updated SOLR-8034:
---------------------------------------
Description:
If the minimum replication factor parameter (minRf) in a solr update request is
not satisfied -- i.e. if the update was not successfully applied on at least n
replicas where n >= minRf -- the shard leader should not put the failed
replicas in "leader initiated recovery" and the client should retry the update
instead.
This is so that in the scenario were minRf is not satisfied, the failed
replicas can still be eligible to become a leader in case of leader failure,
since in the client's perspective this update did not succeed.
This came up from a network partition scenario where the leader becomes
sectioned off from its two followers, but they all could still talk to
zookeeper. The partitioned leader set its two followers as in leader initiated
recovery, so we couldn't just kill off the partitioned node and have a follower
take over leadership. For a minRf=1 case, this is the correct behavior because
the partitioned leader would have accepted updates that the followers don't
have, and therefore we can't switch leadership or we'd lose those updates.
However, in the case of minRf=2, solr never accepted any update in the client's
point of view, so in fact the partitioned leader doesn't have any accepted
update that the followers don't have, and therefore the followers should be
eligible to become leaders. Thus, I'm proposing modifying the leader initiated
recovery logic to not put the followers in recovery if the minRf parameter is
present and is not satisfied.
was:
If the minimum replication factor parameter (minRf) in a solr update request is
not satisfied--i.e. if the update was not successfully applied on at least n
replicas where n >= minRf--the shard leader should not put the failed replicas
in "leader initiated recovery" and the client should retry the update instead.
This is so that in the scenario were minRf is not satisfied, the failed
replicas can still be eligible to become a leader in case of leader failure,
since in the client's perspective this update did not succeed.
This came up from a network partition scenario where the leader becomes
sectioned off from its two followers, but they all could still talk to
zookeeper. The partitioned leader set its two followers as in leader initiated
recovery, so we couldn't just kill off the partitioned node and have a follower
take over leadership. For a minRf=1 case, this is the correct behavior because
the partitioned leader would have accepted updates that the followers don't
have, and therefore we can't switch leadership or we'd lose those updates.
However, in the case of minRf=2, solr never accepted any update in the client's
point of view, so in fact the partitioned leader doesn't have any accepted
update that the followers don't have, and therefore the followers should be
eligible to become leaders. Thus, I'm proposing modifying the leader initiated
recovery logic to not put the followers in recovery if the minRf parameter is
present and is not satisfied.
> If minRF is not satisfied, leader should not put replicas in recovery
> ---------------------------------------------------------------------
>
> Key: SOLR-8034
> URL: https://issues.apache.org/jira/browse/SOLR-8034
> Project: Solr
> Issue Type: Bug
> Components: SolrCloud
> Reporter: Jessica Cheng Mallet
> Labels: solrcloud
> Attachments: SOLR-8034.patch
>
>
> If the minimum replication factor parameter (minRf) in a solr update request
> is not satisfied -- i.e. if the update was not successfully applied on at
> least n replicas where n >= minRf -- the shard leader should not put the
> failed replicas in "leader initiated recovery" and the client should retry
> the update instead.
> This is so that in the scenario were minRf is not satisfied, the failed
> replicas can still be eligible to become a leader in case of leader failure,
> since in the client's perspective this update did not succeed.
> This came up from a network partition scenario where the leader becomes
> sectioned off from its two followers, but they all could still talk to
> zookeeper. The partitioned leader set its two followers as in leader
> initiated recovery, so we couldn't just kill off the partitioned node and
> have a follower take over leadership. For a minRf=1 case, this is the correct
> behavior because the partitioned leader would have accepted updates that the
> followers don't have, and therefore we can't switch leadership or we'd lose
> those updates. However, in the case of minRf=2, solr never accepted any
> update in the client's point of view, so in fact the partitioned leader
> doesn't have any accepted update that the followers don't have, and therefore
> the followers should be eligible to become leaders. Thus, I'm proposing
> modifying the leader initiated recovery logic to not put the followers in
> recovery if the minRf parameter is present and is not satisfied.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]