[jira] [Commented] (SOLR-5468) Option to enforce a majority quorum approach to accepting updates in SolrCloud
[ https://issues.apache.org/jira/browse/SOLR-5468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14037942#comment-14037942 ] Mark Miller commented on SOLR-5468: --- Moved my last (now gone) comment to where it was meant for: SOLR-5495 > Option to enforce a majority quorum approach to accepting updates in SolrCloud > -- > > Key: SOLR-5468 > URL: https://issues.apache.org/jira/browse/SOLR-5468 > Project: Solr > Issue Type: New Feature > Components: SolrCloud >Affects Versions: 4.5 > Environment: All >Reporter: Timothy Potter >Assignee: Timothy Potter >Priority: Minor > Attachments: SOLR-5468.patch, SOLR-5468.patch, SOLR-5468.patch > > > I've been thinking about how SolrCloud deals with write-availability using > in-sync replica sets, in which writes will continue to be accepted so long as > there is at least one healthy node per shard. > For a little background (and to verify my understanding of the process is > correct), SolrCloud only considers active/healthy replicas when acknowledging > a write. Specifically, when a shard leader accepts an update request, it > forwards the request to all active/healthy replicas and only considers the > write successful if all active/healthy replicas ack the write. Any down / > gone replicas are not considered and will sync up with the leader when they > come back online using peer sync or snapshot replication. For instance, if a > shard has 3 nodes, A, B, C with A being the current leader, then writes to > the shard will continue to succeed even if B & C are down. > The issue is that if a shard leader continues to accept updates even if it > loses all of its replicas, then we have acknowledged updates on only 1 node. > If that node, call it A, then fails and one of the previous replicas, call it > B, comes back online before A does, then any writes that A accepted while the > other replicas were offline are at risk to being lost. > SolrCloud does provide a safe-guard mechanism for this problem with the > leaderVoteWait setting, which puts any replicas that come back online before > node A into a temporary wait state. If A comes back online within the wait > period, then all is well as it will become the leader again and no writes > will be lost. As a side note, sys admins definitely need to be made more > aware of this situation as when I first encountered it in my cluster, I had > no idea what it meant. > My question is whether we want to consider an approach where SolrCloud will > not accept writes unless there is a majority of replicas available to accept > the write? For my example, under this approach, we wouldn't accept writes if > both B&C failed, but would if only C did, leaving A & B online. Admittedly, > this lowers the write-availability of the system, so may be something that > should be tunable? > From Mark M: Yeah, this is kind of like one of many little features that we > have just not gotten to yet. I’ve always planned for a param that let’s you > say how many replicas an update must be verified on before responding > success. Seems to make sense to fail that type of request early if you notice > there are not enough replicas up to satisfy the param to begin with. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5468) Option to enforce a majority quorum approach to accepting updates in SolrCloud
[ https://issues.apache.org/jira/browse/SOLR-5468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14037937#comment-14037937 ] Mark Miller commented on SOLR-5468: --- What's the current status of this? I see a lot of commits, but still in progress. I actually still hope to review it, but who knows. I have not touched anything in like a month or more now so I expect a huge backlog of crap will stand before me. > Option to enforce a majority quorum approach to accepting updates in SolrCloud > -- > > Key: SOLR-5468 > URL: https://issues.apache.org/jira/browse/SOLR-5468 > Project: Solr > Issue Type: New Feature > Components: SolrCloud >Affects Versions: 4.5 > Environment: All >Reporter: Timothy Potter >Assignee: Timothy Potter >Priority: Minor > Attachments: SOLR-5468.patch, SOLR-5468.patch, SOLR-5468.patch > > > I've been thinking about how SolrCloud deals with write-availability using > in-sync replica sets, in which writes will continue to be accepted so long as > there is at least one healthy node per shard. > For a little background (and to verify my understanding of the process is > correct), SolrCloud only considers active/healthy replicas when acknowledging > a write. Specifically, when a shard leader accepts an update request, it > forwards the request to all active/healthy replicas and only considers the > write successful if all active/healthy replicas ack the write. Any down / > gone replicas are not considered and will sync up with the leader when they > come back online using peer sync or snapshot replication. For instance, if a > shard has 3 nodes, A, B, C with A being the current leader, then writes to > the shard will continue to succeed even if B & C are down. > The issue is that if a shard leader continues to accept updates even if it > loses all of its replicas, then we have acknowledged updates on only 1 node. > If that node, call it A, then fails and one of the previous replicas, call it > B, comes back online before A does, then any writes that A accepted while the > other replicas were offline are at risk to being lost. > SolrCloud does provide a safe-guard mechanism for this problem with the > leaderVoteWait setting, which puts any replicas that come back online before > node A into a temporary wait state. If A comes back online within the wait > period, then all is well as it will become the leader again and no writes > will be lost. As a side note, sys admins definitely need to be made more > aware of this situation as when I first encountered it in my cluster, I had > no idea what it meant. > My question is whether we want to consider an approach where SolrCloud will > not accept writes unless there is a majority of replicas available to accept > the write? For my example, under this approach, we wouldn't accept writes if > both B&C failed, but would if only C did, leaving A & B online. Admittedly, > this lowers the write-availability of the system, so may be something that > should be tunable? > From Mark M: Yeah, this is kind of like one of many little features that we > have just not gotten to yet. I’ve always planned for a param that let’s you > say how many replicas an update must be verified on before responding > success. Seems to make sense to fail that type of request early if you notice > there are not enough replicas up to satisfy the param to begin with. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5468) Option to enforce a majority quorum approach to accepting updates in SolrCloud
[ https://issues.apache.org/jira/browse/SOLR-5468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14007543#comment-14007543 ] ASF subversion and git services commented on SOLR-5468: --- Commit 1597157 from [~thelabdude] in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1597157 ] Move SOLR-5468 to new features section > Option to enforce a majority quorum approach to accepting updates in SolrCloud > -- > > Key: SOLR-5468 > URL: https://issues.apache.org/jira/browse/SOLR-5468 > Project: Solr > Issue Type: New Feature > Components: SolrCloud >Affects Versions: 4.5 > Environment: All >Reporter: Timothy Potter >Assignee: Timothy Potter >Priority: Minor > Attachments: SOLR-5468.patch, SOLR-5468.patch, SOLR-5468.patch > > > I've been thinking about how SolrCloud deals with write-availability using > in-sync replica sets, in which writes will continue to be accepted so long as > there is at least one healthy node per shard. > For a little background (and to verify my understanding of the process is > correct), SolrCloud only considers active/healthy replicas when acknowledging > a write. Specifically, when a shard leader accepts an update request, it > forwards the request to all active/healthy replicas and only considers the > write successful if all active/healthy replicas ack the write. Any down / > gone replicas are not considered and will sync up with the leader when they > come back online using peer sync or snapshot replication. For instance, if a > shard has 3 nodes, A, B, C with A being the current leader, then writes to > the shard will continue to succeed even if B & C are down. > The issue is that if a shard leader continues to accept updates even if it > loses all of its replicas, then we have acknowledged updates on only 1 node. > If that node, call it A, then fails and one of the previous replicas, call it > B, comes back online before A does, then any writes that A accepted while the > other replicas were offline are at risk to being lost. > SolrCloud does provide a safe-guard mechanism for this problem with the > leaderVoteWait setting, which puts any replicas that come back online before > node A into a temporary wait state. If A comes back online within the wait > period, then all is well as it will become the leader again and no writes > will be lost. As a side note, sys admins definitely need to be made more > aware of this situation as when I first encountered it in my cluster, I had > no idea what it meant. > My question is whether we want to consider an approach where SolrCloud will > not accept writes unless there is a majority of replicas available to accept > the write? For my example, under this approach, we wouldn't accept writes if > both B&C failed, but would if only C did, leaving A & B online. Admittedly, > this lowers the write-availability of the system, so may be something that > should be tunable? > From Mark M: Yeah, this is kind of like one of many little features that we > have just not gotten to yet. I’ve always planned for a param that let’s you > say how many replicas an update must be verified on before responding > success. Seems to make sense to fail that type of request early if you notice > there are not enough replicas up to satisfy the param to begin with. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5468) Option to enforce a majority quorum approach to accepting updates in SolrCloud
[ https://issues.apache.org/jira/browse/SOLR-5468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14007541#comment-14007541 ] ASF subversion and git services commented on SOLR-5468: --- Commit 1597156 from [~thelabdude] in branch 'dev/trunk' [ https://svn.apache.org/r1597156 ] Move SOLR-5468 to new features section. > Option to enforce a majority quorum approach to accepting updates in SolrCloud > -- > > Key: SOLR-5468 > URL: https://issues.apache.org/jira/browse/SOLR-5468 > Project: Solr > Issue Type: New Feature > Components: SolrCloud >Affects Versions: 4.5 > Environment: All >Reporter: Timothy Potter >Assignee: Timothy Potter >Priority: Minor > Attachments: SOLR-5468.patch, SOLR-5468.patch, SOLR-5468.patch > > > I've been thinking about how SolrCloud deals with write-availability using > in-sync replica sets, in which writes will continue to be accepted so long as > there is at least one healthy node per shard. > For a little background (and to verify my understanding of the process is > correct), SolrCloud only considers active/healthy replicas when acknowledging > a write. Specifically, when a shard leader accepts an update request, it > forwards the request to all active/healthy replicas and only considers the > write successful if all active/healthy replicas ack the write. Any down / > gone replicas are not considered and will sync up with the leader when they > come back online using peer sync or snapshot replication. For instance, if a > shard has 3 nodes, A, B, C with A being the current leader, then writes to > the shard will continue to succeed even if B & C are down. > The issue is that if a shard leader continues to accept updates even if it > loses all of its replicas, then we have acknowledged updates on only 1 node. > If that node, call it A, then fails and one of the previous replicas, call it > B, comes back online before A does, then any writes that A accepted while the > other replicas were offline are at risk to being lost. > SolrCloud does provide a safe-guard mechanism for this problem with the > leaderVoteWait setting, which puts any replicas that come back online before > node A into a temporary wait state. If A comes back online within the wait > period, then all is well as it will become the leader again and no writes > will be lost. As a side note, sys admins definitely need to be made more > aware of this situation as when I first encountered it in my cluster, I had > no idea what it meant. > My question is whether we want to consider an approach where SolrCloud will > not accept writes unless there is a majority of replicas available to accept > the write? For my example, under this approach, we wouldn't accept writes if > both B&C failed, but would if only C did, leaving A & B online. Admittedly, > this lowers the write-availability of the system, so may be something that > should be tunable? > From Mark M: Yeah, this is kind of like one of many little features that we > have just not gotten to yet. I’ve always planned for a param that let’s you > say how many replicas an update must be verified on before responding > success. Seems to make sense to fail that type of request early if you notice > there are not enough replicas up to satisfy the param to begin with. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5468) Option to enforce a majority quorum approach to accepting updates in SolrCloud
[ https://issues.apache.org/jira/browse/SOLR-5468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14007531#comment-14007531 ] ASF subversion and git services commented on SOLR-5468: --- Commit 1597149 from [~thelabdude] in branch 'dev/trunk' [ https://svn.apache.org/r1597149 ] SOLR-5468: Now in 4.9 > Option to enforce a majority quorum approach to accepting updates in SolrCloud > -- > > Key: SOLR-5468 > URL: https://issues.apache.org/jira/browse/SOLR-5468 > Project: Solr > Issue Type: New Feature > Components: SolrCloud >Affects Versions: 4.5 > Environment: All >Reporter: Timothy Potter >Assignee: Timothy Potter >Priority: Minor > Attachments: SOLR-5468.patch, SOLR-5468.patch, SOLR-5468.patch > > > I've been thinking about how SolrCloud deals with write-availability using > in-sync replica sets, in which writes will continue to be accepted so long as > there is at least one healthy node per shard. > For a little background (and to verify my understanding of the process is > correct), SolrCloud only considers active/healthy replicas when acknowledging > a write. Specifically, when a shard leader accepts an update request, it > forwards the request to all active/healthy replicas and only considers the > write successful if all active/healthy replicas ack the write. Any down / > gone replicas are not considered and will sync up with the leader when they > come back online using peer sync or snapshot replication. For instance, if a > shard has 3 nodes, A, B, C with A being the current leader, then writes to > the shard will continue to succeed even if B & C are down. > The issue is that if a shard leader continues to accept updates even if it > loses all of its replicas, then we have acknowledged updates on only 1 node. > If that node, call it A, then fails and one of the previous replicas, call it > B, comes back online before A does, then any writes that A accepted while the > other replicas were offline are at risk to being lost. > SolrCloud does provide a safe-guard mechanism for this problem with the > leaderVoteWait setting, which puts any replicas that come back online before > node A into a temporary wait state. If A comes back online within the wait > period, then all is well as it will become the leader again and no writes > will be lost. As a side note, sys admins definitely need to be made more > aware of this situation as when I first encountered it in my cluster, I had > no idea what it meant. > My question is whether we want to consider an approach where SolrCloud will > not accept writes unless there is a majority of replicas available to accept > the write? For my example, under this approach, we wouldn't accept writes if > both B&C failed, but would if only C did, leaving A & B online. Admittedly, > this lowers the write-availability of the system, so may be something that > should be tunable? > From Mark M: Yeah, this is kind of like one of many little features that we > have just not gotten to yet. I’ve always planned for a param that let’s you > say how many replicas an update must be verified on before responding > success. Seems to make sense to fail that type of request early if you notice > there are not enough replicas up to satisfy the param to begin with. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5468) Option to enforce a majority quorum approach to accepting updates in SolrCloud
[ https://issues.apache.org/jira/browse/SOLR-5468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14006120#comment-14006120 ] ASF subversion and git services commented on SOLR-5468: --- Commit 1596918 from [~thelabdude] in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1596918 ] SOLR-5468: add a short wait after healing the partitions for state to propagate to address intermittent Jenkins failures. > Option to enforce a majority quorum approach to accepting updates in SolrCloud > -- > > Key: SOLR-5468 > URL: https://issues.apache.org/jira/browse/SOLR-5468 > Project: Solr > Issue Type: New Feature > Components: SolrCloud >Affects Versions: 4.5 > Environment: All >Reporter: Timothy Potter >Assignee: Timothy Potter >Priority: Minor > Attachments: SOLR-5468.patch, SOLR-5468.patch, SOLR-5468.patch > > > I've been thinking about how SolrCloud deals with write-availability using > in-sync replica sets, in which writes will continue to be accepted so long as > there is at least one healthy node per shard. > For a little background (and to verify my understanding of the process is > correct), SolrCloud only considers active/healthy replicas when acknowledging > a write. Specifically, when a shard leader accepts an update request, it > forwards the request to all active/healthy replicas and only considers the > write successful if all active/healthy replicas ack the write. Any down / > gone replicas are not considered and will sync up with the leader when they > come back online using peer sync or snapshot replication. For instance, if a > shard has 3 nodes, A, B, C with A being the current leader, then writes to > the shard will continue to succeed even if B & C are down. > The issue is that if a shard leader continues to accept updates even if it > loses all of its replicas, then we have acknowledged updates on only 1 node. > If that node, call it A, then fails and one of the previous replicas, call it > B, comes back online before A does, then any writes that A accepted while the > other replicas were offline are at risk to being lost. > SolrCloud does provide a safe-guard mechanism for this problem with the > leaderVoteWait setting, which puts any replicas that come back online before > node A into a temporary wait state. If A comes back online within the wait > period, then all is well as it will become the leader again and no writes > will be lost. As a side note, sys admins definitely need to be made more > aware of this situation as when I first encountered it in my cluster, I had > no idea what it meant. > My question is whether we want to consider an approach where SolrCloud will > not accept writes unless there is a majority of replicas available to accept > the write? For my example, under this approach, we wouldn't accept writes if > both B&C failed, but would if only C did, leaving A & B online. Admittedly, > this lowers the write-availability of the system, so may be something that > should be tunable? > From Mark M: Yeah, this is kind of like one of many little features that we > have just not gotten to yet. I’ve always planned for a param that let’s you > say how many replicas an update must be verified on before responding > success. Seems to make sense to fail that type of request early if you notice > there are not enough replicas up to satisfy the param to begin with. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5468) Option to enforce a majority quorum approach to accepting updates in SolrCloud
[ https://issues.apache.org/jira/browse/SOLR-5468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14006105#comment-14006105 ] ASF subversion and git services commented on SOLR-5468: --- Commit 1596916 from [~thelabdude] in branch 'dev/trunk' [ https://svn.apache.org/r1596916 ] SOLR-5468: Add a little wait for state to propagate after healing partitions; to address intermittent Jenkins failures. > Option to enforce a majority quorum approach to accepting updates in SolrCloud > -- > > Key: SOLR-5468 > URL: https://issues.apache.org/jira/browse/SOLR-5468 > Project: Solr > Issue Type: New Feature > Components: SolrCloud >Affects Versions: 4.5 > Environment: All >Reporter: Timothy Potter >Assignee: Timothy Potter >Priority: Minor > Attachments: SOLR-5468.patch, SOLR-5468.patch, SOLR-5468.patch > > > I've been thinking about how SolrCloud deals with write-availability using > in-sync replica sets, in which writes will continue to be accepted so long as > there is at least one healthy node per shard. > For a little background (and to verify my understanding of the process is > correct), SolrCloud only considers active/healthy replicas when acknowledging > a write. Specifically, when a shard leader accepts an update request, it > forwards the request to all active/healthy replicas and only considers the > write successful if all active/healthy replicas ack the write. Any down / > gone replicas are not considered and will sync up with the leader when they > come back online using peer sync or snapshot replication. For instance, if a > shard has 3 nodes, A, B, C with A being the current leader, then writes to > the shard will continue to succeed even if B & C are down. > The issue is that if a shard leader continues to accept updates even if it > loses all of its replicas, then we have acknowledged updates on only 1 node. > If that node, call it A, then fails and one of the previous replicas, call it > B, comes back online before A does, then any writes that A accepted while the > other replicas were offline are at risk to being lost. > SolrCloud does provide a safe-guard mechanism for this problem with the > leaderVoteWait setting, which puts any replicas that come back online before > node A into a temporary wait state. If A comes back online within the wait > period, then all is well as it will become the leader again and no writes > will be lost. As a side note, sys admins definitely need to be made more > aware of this situation as when I first encountered it in my cluster, I had > no idea what it meant. > My question is whether we want to consider an approach where SolrCloud will > not accept writes unless there is a majority of replicas available to accept > the write? For my example, under this approach, we wouldn't accept writes if > both B&C failed, but would if only C did, leaving A & B online. Admittedly, > this lowers the write-availability of the system, so may be something that > should be tunable? > From Mark M: Yeah, this is kind of like one of many little features that we > have just not gotten to yet. I’ve always planned for a param that let’s you > say how many replicas an update must be verified on before responding > success. Seems to make sense to fail that type of request early if you notice > there are not enough replicas up to satisfy the param to begin with. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5468) Option to enforce a majority quorum approach to accepting updates in SolrCloud
[ https://issues.apache.org/jira/browse/SOLR-5468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14005319#comment-14005319 ] ASF subversion and git services commented on SOLR-5468: --- Commit 1596703 from [~thelabdude] in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1596703 ] SOLR-5468: report replication factor that was achieved for an update request if requested by the client application; port from trunk > Option to enforce a majority quorum approach to accepting updates in SolrCloud > -- > > Key: SOLR-5468 > URL: https://issues.apache.org/jira/browse/SOLR-5468 > Project: Solr > Issue Type: New Feature > Components: SolrCloud >Affects Versions: 4.5 > Environment: All >Reporter: Timothy Potter >Assignee: Timothy Potter >Priority: Minor > Attachments: SOLR-5468.patch, SOLR-5468.patch, SOLR-5468.patch > > > I've been thinking about how SolrCloud deals with write-availability using > in-sync replica sets, in which writes will continue to be accepted so long as > there is at least one healthy node per shard. > For a little background (and to verify my understanding of the process is > correct), SolrCloud only considers active/healthy replicas when acknowledging > a write. Specifically, when a shard leader accepts an update request, it > forwards the request to all active/healthy replicas and only considers the > write successful if all active/healthy replicas ack the write. Any down / > gone replicas are not considered and will sync up with the leader when they > come back online using peer sync or snapshot replication. For instance, if a > shard has 3 nodes, A, B, C with A being the current leader, then writes to > the shard will continue to succeed even if B & C are down. > The issue is that if a shard leader continues to accept updates even if it > loses all of its replicas, then we have acknowledged updates on only 1 node. > If that node, call it A, then fails and one of the previous replicas, call it > B, comes back online before A does, then any writes that A accepted while the > other replicas were offline are at risk to being lost. > SolrCloud does provide a safe-guard mechanism for this problem with the > leaderVoteWait setting, which puts any replicas that come back online before > node A into a temporary wait state. If A comes back online within the wait > period, then all is well as it will become the leader again and no writes > will be lost. As a side note, sys admins definitely need to be made more > aware of this situation as when I first encountered it in my cluster, I had > no idea what it meant. > My question is whether we want to consider an approach where SolrCloud will > not accept writes unless there is a majority of replicas available to accept > the write? For my example, under this approach, we wouldn't accept writes if > both B&C failed, but would if only C did, leaving A & B online. Admittedly, > this lowers the write-availability of the system, so may be something that > should be tunable? > From Mark M: Yeah, this is kind of like one of many little features that we > have just not gotten to yet. I’ve always planned for a param that let’s you > say how many replicas an update must be verified on before responding > success. Seems to make sense to fail that type of request early if you notice > there are not enough replicas up to satisfy the param to begin with. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5468) Option to enforce a majority quorum approach to accepting updates in SolrCloud
[ https://issues.apache.org/jira/browse/SOLR-5468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14005099#comment-14005099 ] ASF subversion and git services commented on SOLR-5468: --- Commit 1596652 from [~thelabdude] in branch 'dev/trunk' [ https://svn.apache.org/r1596652 ] SOLR-5468: Improve reporting of cluster state when assertions fail; to help diagnose cause of Jenkins failures. > Option to enforce a majority quorum approach to accepting updates in SolrCloud > -- > > Key: SOLR-5468 > URL: https://issues.apache.org/jira/browse/SOLR-5468 > Project: Solr > Issue Type: New Feature > Components: SolrCloud >Affects Versions: 4.5 > Environment: All >Reporter: Timothy Potter >Assignee: Timothy Potter >Priority: Minor > Attachments: SOLR-5468.patch, SOLR-5468.patch, SOLR-5468.patch > > > I've been thinking about how SolrCloud deals with write-availability using > in-sync replica sets, in which writes will continue to be accepted so long as > there is at least one healthy node per shard. > For a little background (and to verify my understanding of the process is > correct), SolrCloud only considers active/healthy replicas when acknowledging > a write. Specifically, when a shard leader accepts an update request, it > forwards the request to all active/healthy replicas and only considers the > write successful if all active/healthy replicas ack the write. Any down / > gone replicas are not considered and will sync up with the leader when they > come back online using peer sync or snapshot replication. For instance, if a > shard has 3 nodes, A, B, C with A being the current leader, then writes to > the shard will continue to succeed even if B & C are down. > The issue is that if a shard leader continues to accept updates even if it > loses all of its replicas, then we have acknowledged updates on only 1 node. > If that node, call it A, then fails and one of the previous replicas, call it > B, comes back online before A does, then any writes that A accepted while the > other replicas were offline are at risk to being lost. > SolrCloud does provide a safe-guard mechanism for this problem with the > leaderVoteWait setting, which puts any replicas that come back online before > node A into a temporary wait state. If A comes back online within the wait > period, then all is well as it will become the leader again and no writes > will be lost. As a side note, sys admins definitely need to be made more > aware of this situation as when I first encountered it in my cluster, I had > no idea what it meant. > My question is whether we want to consider an approach where SolrCloud will > not accept writes unless there is a majority of replicas available to accept > the write? For my example, under this approach, we wouldn't accept writes if > both B&C failed, but would if only C did, leaving A & B online. Admittedly, > this lowers the write-availability of the system, so may be something that > should be tunable? > From Mark M: Yeah, this is kind of like one of many little features that we > have just not gotten to yet. I’ve always planned for a param that let’s you > say how many replicas an update must be verified on before responding > success. Seems to make sense to fail that type of request early if you notice > there are not enough replicas up to satisfy the param to begin with. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5468) Option to enforce a majority quorum approach to accepting updates in SolrCloud
[ https://issues.apache.org/jira/browse/SOLR-5468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14003639#comment-14003639 ] ASF subversion and git services commented on SOLR-5468: --- Commit 1596315 from [~thelabdude] in branch 'dev/trunk' [ https://svn.apache.org/r1596315 ] SOLR-5495: Re-arrange location of SOLR-5495 and SOLR-5468 in CHANGES.txt > Option to enforce a majority quorum approach to accepting updates in SolrCloud > -- > > Key: SOLR-5468 > URL: https://issues.apache.org/jira/browse/SOLR-5468 > Project: Solr > Issue Type: New Feature > Components: SolrCloud >Affects Versions: 4.5 > Environment: All >Reporter: Timothy Potter >Assignee: Timothy Potter >Priority: Minor > Attachments: SOLR-5468.patch, SOLR-5468.patch, SOLR-5468.patch > > > I've been thinking about how SolrCloud deals with write-availability using > in-sync replica sets, in which writes will continue to be accepted so long as > there is at least one healthy node per shard. > For a little background (and to verify my understanding of the process is > correct), SolrCloud only considers active/healthy replicas when acknowledging > a write. Specifically, when a shard leader accepts an update request, it > forwards the request to all active/healthy replicas and only considers the > write successful if all active/healthy replicas ack the write. Any down / > gone replicas are not considered and will sync up with the leader when they > come back online using peer sync or snapshot replication. For instance, if a > shard has 3 nodes, A, B, C with A being the current leader, then writes to > the shard will continue to succeed even if B & C are down. > The issue is that if a shard leader continues to accept updates even if it > loses all of its replicas, then we have acknowledged updates on only 1 node. > If that node, call it A, then fails and one of the previous replicas, call it > B, comes back online before A does, then any writes that A accepted while the > other replicas were offline are at risk to being lost. > SolrCloud does provide a safe-guard mechanism for this problem with the > leaderVoteWait setting, which puts any replicas that come back online before > node A into a temporary wait state. If A comes back online within the wait > period, then all is well as it will become the leader again and no writes > will be lost. As a side note, sys admins definitely need to be made more > aware of this situation as when I first encountered it in my cluster, I had > no idea what it meant. > My question is whether we want to consider an approach where SolrCloud will > not accept writes unless there is a majority of replicas available to accept > the write? For my example, under this approach, we wouldn't accept writes if > both B&C failed, but would if only C did, leaving A & B online. Admittedly, > this lowers the write-availability of the system, so may be something that > should be tunable? > From Mark M: Yeah, this is kind of like one of many little features that we > have just not gotten to yet. I’ve always planned for a param that let’s you > say how many replicas an update must be verified on before responding > success. Seems to make sense to fail that type of request early if you notice > there are not enough replicas up to satisfy the param to begin with. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5468) Option to enforce a majority quorum approach to accepting updates in SolrCloud
[ https://issues.apache.org/jira/browse/SOLR-5468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14003206#comment-14003206 ] ASF subversion and git services commented on SOLR-5468: --- Commit 1596234 from [~thelabdude] in branch 'dev/trunk' [ https://svn.apache.org/r1596234 ] SOLR-5468: Add wait loop to see replicas become active after restoring partitions; to address intermittent Jenkins test failures. > Option to enforce a majority quorum approach to accepting updates in SolrCloud > -- > > Key: SOLR-5468 > URL: https://issues.apache.org/jira/browse/SOLR-5468 > Project: Solr > Issue Type: New Feature > Components: SolrCloud >Affects Versions: 4.5 > Environment: All >Reporter: Timothy Potter >Assignee: Timothy Potter >Priority: Minor > Attachments: SOLR-5468.patch, SOLR-5468.patch, SOLR-5468.patch > > > I've been thinking about how SolrCloud deals with write-availability using > in-sync replica sets, in which writes will continue to be accepted so long as > there is at least one healthy node per shard. > For a little background (and to verify my understanding of the process is > correct), SolrCloud only considers active/healthy replicas when acknowledging > a write. Specifically, when a shard leader accepts an update request, it > forwards the request to all active/healthy replicas and only considers the > write successful if all active/healthy replicas ack the write. Any down / > gone replicas are not considered and will sync up with the leader when they > come back online using peer sync or snapshot replication. For instance, if a > shard has 3 nodes, A, B, C with A being the current leader, then writes to > the shard will continue to succeed even if B & C are down. > The issue is that if a shard leader continues to accept updates even if it > loses all of its replicas, then we have acknowledged updates on only 1 node. > If that node, call it A, then fails and one of the previous replicas, call it > B, comes back online before A does, then any writes that A accepted while the > other replicas were offline are at risk to being lost. > SolrCloud does provide a safe-guard mechanism for this problem with the > leaderVoteWait setting, which puts any replicas that come back online before > node A into a temporary wait state. If A comes back online within the wait > period, then all is well as it will become the leader again and no writes > will be lost. As a side note, sys admins definitely need to be made more > aware of this situation as when I first encountered it in my cluster, I had > no idea what it meant. > My question is whether we want to consider an approach where SolrCloud will > not accept writes unless there is a majority of replicas available to accept > the write? For my example, under this approach, we wouldn't accept writes if > both B&C failed, but would if only C did, leaving A & B online. Admittedly, > this lowers the write-availability of the system, so may be something that > should be tunable? > From Mark M: Yeah, this is kind of like one of many little features that we > have just not gotten to yet. I’ve always planned for a param that let’s you > say how many replicas an update must be verified on before responding > success. Seems to make sense to fail that type of request early if you notice > there are not enough replicas up to satisfy the param to begin with. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5468) Option to enforce a majority quorum approach to accepting updates in SolrCloud
[ https://issues.apache.org/jira/browse/SOLR-5468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14002546#comment-14002546 ] ASF subversion and git services commented on SOLR-5468: --- Commit 1596092 from [~thelabdude] in branch 'dev/trunk' [ https://svn.apache.org/r1596092 ] SOLR-5468: report replication factor that was achieved for an update request if requested by the client application. > Option to enforce a majority quorum approach to accepting updates in SolrCloud > -- > > Key: SOLR-5468 > URL: https://issues.apache.org/jira/browse/SOLR-5468 > Project: Solr > Issue Type: New Feature > Components: SolrCloud >Affects Versions: 4.5 > Environment: All >Reporter: Timothy Potter >Assignee: Timothy Potter >Priority: Minor > Attachments: SOLR-5468.patch, SOLR-5468.patch > > > I've been thinking about how SolrCloud deals with write-availability using > in-sync replica sets, in which writes will continue to be accepted so long as > there is at least one healthy node per shard. > For a little background (and to verify my understanding of the process is > correct), SolrCloud only considers active/healthy replicas when acknowledging > a write. Specifically, when a shard leader accepts an update request, it > forwards the request to all active/healthy replicas and only considers the > write successful if all active/healthy replicas ack the write. Any down / > gone replicas are not considered and will sync up with the leader when they > come back online using peer sync or snapshot replication. For instance, if a > shard has 3 nodes, A, B, C with A being the current leader, then writes to > the shard will continue to succeed even if B & C are down. > The issue is that if a shard leader continues to accept updates even if it > loses all of its replicas, then we have acknowledged updates on only 1 node. > If that node, call it A, then fails and one of the previous replicas, call it > B, comes back online before A does, then any writes that A accepted while the > other replicas were offline are at risk to being lost. > SolrCloud does provide a safe-guard mechanism for this problem with the > leaderVoteWait setting, which puts any replicas that come back online before > node A into a temporary wait state. If A comes back online within the wait > period, then all is well as it will become the leader again and no writes > will be lost. As a side note, sys admins definitely need to be made more > aware of this situation as when I first encountered it in my cluster, I had > no idea what it meant. > My question is whether we want to consider an approach where SolrCloud will > not accept writes unless there is a majority of replicas available to accept > the write? For my example, under this approach, we wouldn't accept writes if > both B&C failed, but would if only C did, leaving A & B online. Admittedly, > this lowers the write-availability of the system, so may be something that > should be tunable? > From Mark M: Yeah, this is kind of like one of many little features that we > have just not gotten to yet. I’ve always planned for a param that let’s you > say how many replicas an update must be verified on before responding > success. Seems to make sense to fail that type of request early if you notice > there are not enough replicas up to satisfy the param to begin with. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5468) Option to enforce a majority quorum approach to accepting updates in SolrCloud
[ https://issues.apache.org/jira/browse/SOLR-5468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13998379#comment-13998379 ] Timothy Potter commented on SOLR-5468: -- Hoping to commit this in the next couple of days. Any feedback before then would be appreciated. > Option to enforce a majority quorum approach to accepting updates in SolrCloud > -- > > Key: SOLR-5468 > URL: https://issues.apache.org/jira/browse/SOLR-5468 > Project: Solr > Issue Type: New Feature > Components: SolrCloud >Affects Versions: 4.5 > Environment: All >Reporter: Timothy Potter >Assignee: Timothy Potter >Priority: Minor > Attachments: SOLR-5468.patch, SOLR-5468.patch > > > I've been thinking about how SolrCloud deals with write-availability using > in-sync replica sets, in which writes will continue to be accepted so long as > there is at least one healthy node per shard. > For a little background (and to verify my understanding of the process is > correct), SolrCloud only considers active/healthy replicas when acknowledging > a write. Specifically, when a shard leader accepts an update request, it > forwards the request to all active/healthy replicas and only considers the > write successful if all active/healthy replicas ack the write. Any down / > gone replicas are not considered and will sync up with the leader when they > come back online using peer sync or snapshot replication. For instance, if a > shard has 3 nodes, A, B, C with A being the current leader, then writes to > the shard will continue to succeed even if B & C are down. > The issue is that if a shard leader continues to accept updates even if it > loses all of its replicas, then we have acknowledged updates on only 1 node. > If that node, call it A, then fails and one of the previous replicas, call it > B, comes back online before A does, then any writes that A accepted while the > other replicas were offline are at risk to being lost. > SolrCloud does provide a safe-guard mechanism for this problem with the > leaderVoteWait setting, which puts any replicas that come back online before > node A into a temporary wait state. If A comes back online within the wait > period, then all is well as it will become the leader again and no writes > will be lost. As a side note, sys admins definitely need to be made more > aware of this situation as when I first encountered it in my cluster, I had > no idea what it meant. > My question is whether we want to consider an approach where SolrCloud will > not accept writes unless there is a majority of replicas available to accept > the write? For my example, under this approach, we wouldn't accept writes if > both B&C failed, but would if only C did, leaving A & B online. Admittedly, > this lowers the write-availability of the system, so may be something that > should be tunable? > From Mark M: Yeah, this is kind of like one of many little features that we > have just not gotten to yet. I’ve always planned for a param that let’s you > say how many replicas an update must be verified on before responding > success. Seems to make sense to fail that type of request early if you notice > there are not enough replicas up to satisfy the param to begin with. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5468) Option to enforce a majority quorum approach to accepting updates in SolrCloud
[ https://issues.apache.org/jira/browse/SOLR-5468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13990164#comment-13990164 ] Mark Miller commented on SOLR-5468: --- No, hinted hand off is for when a shard is not being served. Less of a interest of mine because auto replica failover with hdfs should be a much better solution for that. > Option to enforce a majority quorum approach to accepting updates in SolrCloud > -- > > Key: SOLR-5468 > URL: https://issues.apache.org/jira/browse/SOLR-5468 > Project: Solr > Issue Type: New Feature > Components: SolrCloud >Affects Versions: 4.5 > Environment: All >Reporter: Timothy Potter >Assignee: Timothy Potter >Priority: Minor > Attachments: SOLR-5468.patch > > > I've been thinking about how SolrCloud deals with write-availability using > in-sync replica sets, in which writes will continue to be accepted so long as > there is at least one healthy node per shard. > For a little background (and to verify my understanding of the process is > correct), SolrCloud only considers active/healthy replicas when acknowledging > a write. Specifically, when a shard leader accepts an update request, it > forwards the request to all active/healthy replicas and only considers the > write successful if all active/healthy replicas ack the write. Any down / > gone replicas are not considered and will sync up with the leader when they > come back online using peer sync or snapshot replication. For instance, if a > shard has 3 nodes, A, B, C with A being the current leader, then writes to > the shard will continue to succeed even if B & C are down. > The issue is that if a shard leader continues to accept updates even if it > loses all of its replicas, then we have acknowledged updates on only 1 node. > If that node, call it A, then fails and one of the previous replicas, call it > B, comes back online before A does, then any writes that A accepted while the > other replicas were offline are at risk to being lost. > SolrCloud does provide a safe-guard mechanism for this problem with the > leaderVoteWait setting, which puts any replicas that come back online before > node A into a temporary wait state. If A comes back online within the wait > period, then all is well as it will become the leader again and no writes > will be lost. As a side note, sys admins definitely need to be made more > aware of this situation as when I first encountered it in my cluster, I had > no idea what it meant. > My question is whether we want to consider an approach where SolrCloud will > not accept writes unless there is a majority of replicas available to accept > the write? For my example, under this approach, we wouldn't accept writes if > both B&C failed, but would if only C did, leaving A & B online. Admittedly, > this lowers the write-availability of the system, so may be something that > should be tunable? > From Mark M: Yeah, this is kind of like one of many little features that we > have just not gotten to yet. I’ve always planned for a param that let’s you > say how many replicas an update must be verified on before responding > success. Seems to make sense to fail that type of request early if you notice > there are not enough replicas up to satisfy the param to begin with. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5468) Option to enforce a majority quorum approach to accepting updates in SolrCloud
[ https://issues.apache.org/jira/browse/SOLR-5468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13990144#comment-13990144 ] Otis Gospodnetic commented on SOLR-5468: I just skimmed the comments here the other day. I could be wrong, but aren't you guys describing Hinted Handoff? If so, haven't applications like Voldemort and Cassandra and maybe others already dealt with this and may have code or at least approaches that that has been in production for a while and that could be followed/used? Maybe ES deals with this too, though I can't recall at the moment. [~gro] do you know? > Option to enforce a majority quorum approach to accepting updates in SolrCloud > -- > > Key: SOLR-5468 > URL: https://issues.apache.org/jira/browse/SOLR-5468 > Project: Solr > Issue Type: New Feature > Components: SolrCloud >Affects Versions: 4.5 > Environment: All >Reporter: Timothy Potter >Assignee: Timothy Potter >Priority: Minor > Attachments: SOLR-5468.patch > > > I've been thinking about how SolrCloud deals with write-availability using > in-sync replica sets, in which writes will continue to be accepted so long as > there is at least one healthy node per shard. > For a little background (and to verify my understanding of the process is > correct), SolrCloud only considers active/healthy replicas when acknowledging > a write. Specifically, when a shard leader accepts an update request, it > forwards the request to all active/healthy replicas and only considers the > write successful if all active/healthy replicas ack the write. Any down / > gone replicas are not considered and will sync up with the leader when they > come back online using peer sync or snapshot replication. For instance, if a > shard has 3 nodes, A, B, C with A being the current leader, then writes to > the shard will continue to succeed even if B & C are down. > The issue is that if a shard leader continues to accept updates even if it > loses all of its replicas, then we have acknowledged updates on only 1 node. > If that node, call it A, then fails and one of the previous replicas, call it > B, comes back online before A does, then any writes that A accepted while the > other replicas were offline are at risk to being lost. > SolrCloud does provide a safe-guard mechanism for this problem with the > leaderVoteWait setting, which puts any replicas that come back online before > node A into a temporary wait state. If A comes back online within the wait > period, then all is well as it will become the leader again and no writes > will be lost. As a side note, sys admins definitely need to be made more > aware of this situation as when I first encountered it in my cluster, I had > no idea what it meant. > My question is whether we want to consider an approach where SolrCloud will > not accept writes unless there is a majority of replicas available to accept > the write? For my example, under this approach, we wouldn't accept writes if > both B&C failed, but would if only C did, leaving A & B online. Admittedly, > this lowers the write-availability of the system, so may be something that > should be tunable? > From Mark M: Yeah, this is kind of like one of many little features that we > have just not gotten to yet. I’ve always planned for a param that let’s you > say how many replicas an update must be verified on before responding > success. Seems to make sense to fail that type of request early if you notice > there are not enough replicas up to satisfy the param to begin with. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5468) Option to enforce a majority quorum approach to accepting updates in SolrCloud
[ https://issues.apache.org/jira/browse/SOLR-5468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13989724#comment-13989724 ] Mark Miller commented on SOLR-5468: --- I won't be able to catch up to this in a competent for a week or two but a quick comment: First, this is def a hard problem. It's something I have thought a bit about, but have not cracked yet. bq. However, that's exactly what happens automatically I'm not sure 'exactly' fits. I think it's probably important to let the client know that an update only made it to one replica for example. That mean's you are in a position to possibly lose the update. One possibility is to define the fail the same way as a leader fail in the middle of your update. You don't know what happened in that case - the doc may be in or not. If we do the same here, the client will know, hey, this didn't make it to 2 or 3 replicas - it may be in the cluster, but we can't count on it's durability to the level we requested. The client can then choose how to handle this - accept what happened or take another action. Just spit balling at a conference, but I think there is a way to define the semantics here so that it's easier on us, but still gives the client the info they need to understand how durable that update was. This would not be the only case a fail does not mean the update is for sure not in the cluster - you can't get around that on fails of the leader in the middle of an update anyway. > Option to enforce a majority quorum approach to accepting updates in SolrCloud > -- > > Key: SOLR-5468 > URL: https://issues.apache.org/jira/browse/SOLR-5468 > Project: Solr > Issue Type: New Feature > Components: SolrCloud >Affects Versions: 4.5 > Environment: All >Reporter: Timothy Potter >Assignee: Timothy Potter >Priority: Minor > Attachments: SOLR-5468.patch > > > I've been thinking about how SolrCloud deals with write-availability using > in-sync replica sets, in which writes will continue to be accepted so long as > there is at least one healthy node per shard. > For a little background (and to verify my understanding of the process is > correct), SolrCloud only considers active/healthy replicas when acknowledging > a write. Specifically, when a shard leader accepts an update request, it > forwards the request to all active/healthy replicas and only considers the > write successful if all active/healthy replicas ack the write. Any down / > gone replicas are not considered and will sync up with the leader when they > come back online using peer sync or snapshot replication. For instance, if a > shard has 3 nodes, A, B, C with A being the current leader, then writes to > the shard will continue to succeed even if B & C are down. > The issue is that if a shard leader continues to accept updates even if it > loses all of its replicas, then we have acknowledged updates on only 1 node. > If that node, call it A, then fails and one of the previous replicas, call it > B, comes back online before A does, then any writes that A accepted while the > other replicas were offline are at risk to being lost. > SolrCloud does provide a safe-guard mechanism for this problem with the > leaderVoteWait setting, which puts any replicas that come back online before > node A into a temporary wait state. If A comes back online within the wait > period, then all is well as it will become the leader again and no writes > will be lost. As a side note, sys admins definitely need to be made more > aware of this situation as when I first encountered it in my cluster, I had > no idea what it meant. > My question is whether we want to consider an approach where SolrCloud will > not accept writes unless there is a majority of replicas available to accept > the write? For my example, under this approach, we wouldn't accept writes if > both B&C failed, but would if only C did, leaving A & B online. Admittedly, > this lowers the write-availability of the system, so may be something that > should be tunable? > From Mark M: Yeah, this is kind of like one of many little features that we > have just not gotten to yet. I’ve always planned for a param that let’s you > say how many replicas an update must be verified on before responding > success. Seems to make sense to fail that type of request early if you notice > there are not enough replicas up to satisfy the param to begin with. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5468) Option to enforce a majority quorum approach to accepting updates in SolrCloud
[ https://issues.apache.org/jira/browse/SOLR-5468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13989664#comment-13989664 ] Timothy Potter commented on SOLR-5468: -- So I've thought about this some more and am not sure it's even worth pursuing any further? Enabling this feature incurs a cost because it layers a synchronous blocking action (CountDownLatch.await) on top of an asynchronous process (replication from leader to N replicas). So given there's a cost, what's the benefit? On the surface, it seems like a reasonable idea to tell a client application the level of replication that was achieved for an update request, but the best a client application can do is retry the update once the problem that caused degraded replication is resolved (only if it's an idempotent update at that). However, that's exactly what happens automatically when a previously partitioned replica is healed. Specifically, the replica will be marked DOWN (see SOLR-5495) and then it must recover before becoming active, which gives the same result as the client application re-trying a request. Lastly, I thought more about Mark's point on failing fast when we know that a request cannot meeting the desired RF requested by the client. In this one case, we won't have to worry about any backing out because we'd catch the problem before the local add/update on the leader. However, this sort of lulls the client into a false security in that we really can't "fail fast" if a replica goes down after the local add/update on the leader. So that would mean two different behaviors depending on timing which I don't think we want. Again, I think we should focus on hardening the out-of-sync replica recovery and leader failover scenarios. > Option to enforce a majority quorum approach to accepting updates in SolrCloud > -- > > Key: SOLR-5468 > URL: https://issues.apache.org/jira/browse/SOLR-5468 > Project: Solr > Issue Type: New Feature > Components: SolrCloud >Affects Versions: 4.5 > Environment: All >Reporter: Timothy Potter >Assignee: Timothy Potter >Priority: Minor > Attachments: SOLR-5468.patch > > > I've been thinking about how SolrCloud deals with write-availability using > in-sync replica sets, in which writes will continue to be accepted so long as > there is at least one healthy node per shard. > For a little background (and to verify my understanding of the process is > correct), SolrCloud only considers active/healthy replicas when acknowledging > a write. Specifically, when a shard leader accepts an update request, it > forwards the request to all active/healthy replicas and only considers the > write successful if all active/healthy replicas ack the write. Any down / > gone replicas are not considered and will sync up with the leader when they > come back online using peer sync or snapshot replication. For instance, if a > shard has 3 nodes, A, B, C with A being the current leader, then writes to > the shard will continue to succeed even if B & C are down. > The issue is that if a shard leader continues to accept updates even if it > loses all of its replicas, then we have acknowledged updates on only 1 node. > If that node, call it A, then fails and one of the previous replicas, call it > B, comes back online before A does, then any writes that A accepted while the > other replicas were offline are at risk to being lost. > SolrCloud does provide a safe-guard mechanism for this problem with the > leaderVoteWait setting, which puts any replicas that come back online before > node A into a temporary wait state. If A comes back online within the wait > period, then all is well as it will become the leader again and no writes > will be lost. As a side note, sys admins definitely need to be made more > aware of this situation as when I first encountered it in my cluster, I had > no idea what it meant. > My question is whether we want to consider an approach where SolrCloud will > not accept writes unless there is a majority of replicas available to accept > the write? For my example, under this approach, we wouldn't accept writes if > both B&C failed, but would if only C did, leaving A & B online. Admittedly, > this lowers the write-availability of the system, so may be something that > should be tunable? > From Mark M: Yeah, this is kind of like one of many little features that we > have just not gotten to yet. I’ve always planned for a param that let’s you > say how many replicas an update must be verified on before responding > success. Seems to make sense to fail that type of request early if you notice > there are not enough replicas up to satisfy the param to begin with. -- This message was sent by Atlassian
[jira] [Commented] (SOLR-5468) Option to enforce a majority quorum approach to accepting updates in SolrCloud
[ https://issues.apache.org/jira/browse/SOLR-5468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13983153#comment-13983153 ] Mark Miller commented on SOLR-5468: --- bq. Do think this should be an update request parameter or collection level setting? Yeah, I think it's common to allow passing this per request so the client can vary it depending on the data. I'm sure configurable defaults are probably worth looking at too though. > Option to enforce a majority quorum approach to accepting updates in SolrCloud > -- > > Key: SOLR-5468 > URL: https://issues.apache.org/jira/browse/SOLR-5468 > Project: Solr > Issue Type: New Feature > Components: SolrCloud >Affects Versions: 4.5 > Environment: All >Reporter: Timothy Potter >Assignee: Timothy Potter >Priority: Minor > > I've been thinking about how SolrCloud deals with write-availability using > in-sync replica sets, in which writes will continue to be accepted so long as > there is at least one healthy node per shard. > For a little background (and to verify my understanding of the process is > correct), SolrCloud only considers active/healthy replicas when acknowledging > a write. Specifically, when a shard leader accepts an update request, it > forwards the request to all active/healthy replicas and only considers the > write successful if all active/healthy replicas ack the write. Any down / > gone replicas are not considered and will sync up with the leader when they > come back online using peer sync or snapshot replication. For instance, if a > shard has 3 nodes, A, B, C with A being the current leader, then writes to > the shard will continue to succeed even if B & C are down. > The issue is that if a shard leader continues to accept updates even if it > loses all of its replicas, then we have acknowledged updates on only 1 node. > If that node, call it A, then fails and one of the previous replicas, call it > B, comes back online before A does, then any writes that A accepted while the > other replicas were offline are at risk to being lost. > SolrCloud does provide a safe-guard mechanism for this problem with the > leaderVoteWait setting, which puts any replicas that come back online before > node A into a temporary wait state. If A comes back online within the wait > period, then all is well as it will become the leader again and no writes > will be lost. As a side note, sys admins definitely need to be made more > aware of this situation as when I first encountered it in my cluster, I had > no idea what it meant. > My question is whether we want to consider an approach where SolrCloud will > not accept writes unless there is a majority of replicas available to accept > the write? For my example, under this approach, we wouldn't accept writes if > both B&C failed, but would if only C did, leaving A & B online. Admittedly, > this lowers the write-availability of the system, so may be something that > should be tunable? > From Mark M: Yeah, this is kind of like one of many little features that we > have just not gotten to yet. I’ve always planned for a param that let’s you > say how many replicas an update must be verified on before responding > success. Seems to make sense to fail that type of request early if you notice > there are not enough replicas up to satisfy the param to begin with. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5468) Option to enforce a majority quorum approach to accepting updates in SolrCloud
[ https://issues.apache.org/jira/browse/SOLR-5468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13983138#comment-13983138 ] Timothy Potter commented on SOLR-5468: -- Thanks for the quick feedback. Do think this should be an update request parameter or collection level setting? Just re-read your original comment about this and sounds like you were thinking a parameter with each request. I like that since it gives the option to by-pass this checking when doing large bulk loads of the collection and only apply it when it makes sense. In terms of fine-grained error response handling, looks like this is captured in: https://issues.apache.org/jira/browse/SOLR-3382 > Option to enforce a majority quorum approach to accepting updates in SolrCloud > -- > > Key: SOLR-5468 > URL: https://issues.apache.org/jira/browse/SOLR-5468 > Project: Solr > Issue Type: New Feature > Components: SolrCloud >Affects Versions: 4.5 > Environment: All >Reporter: Timothy Potter >Assignee: Timothy Potter >Priority: Minor > > I've been thinking about how SolrCloud deals with write-availability using > in-sync replica sets, in which writes will continue to be accepted so long as > there is at least one healthy node per shard. > For a little background (and to verify my understanding of the process is > correct), SolrCloud only considers active/healthy replicas when acknowledging > a write. Specifically, when a shard leader accepts an update request, it > forwards the request to all active/healthy replicas and only considers the > write successful if all active/healthy replicas ack the write. Any down / > gone replicas are not considered and will sync up with the leader when they > come back online using peer sync or snapshot replication. For instance, if a > shard has 3 nodes, A, B, C with A being the current leader, then writes to > the shard will continue to succeed even if B & C are down. > The issue is that if a shard leader continues to accept updates even if it > loses all of its replicas, then we have acknowledged updates on only 1 node. > If that node, call it A, then fails and one of the previous replicas, call it > B, comes back online before A does, then any writes that A accepted while the > other replicas were offline are at risk to being lost. > SolrCloud does provide a safe-guard mechanism for this problem with the > leaderVoteWait setting, which puts any replicas that come back online before > node A into a temporary wait state. If A comes back online within the wait > period, then all is well as it will become the leader again and no writes > will be lost. As a side note, sys admins definitely need to be made more > aware of this situation as when I first encountered it in my cluster, I had > no idea what it meant. > My question is whether we want to consider an approach where SolrCloud will > not accept writes unless there is a majority of replicas available to accept > the write? For my example, under this approach, we wouldn't accept writes if > both B&C failed, but would if only C did, leaving A & B online. Admittedly, > this lowers the write-availability of the system, so may be something that > should be tunable? > From Mark M: Yeah, this is kind of like one of many little features that we > have just not gotten to yet. I’ve always planned for a param that let’s you > say how many replicas an update must be verified on before responding > success. Seems to make sense to fail that type of request early if you notice > there are not enough replicas up to satisfy the param to begin with. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5468) Option to enforce a majority quorum approach to accepting updates in SolrCloud
[ https://issues.apache.org/jira/browse/SOLR-5468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13983035#comment-13983035 ] Mark Miller commented on SOLR-5468: --- bq. for now it seems sufficient to let users decide how many replicas a write must succeed on to be considered successful. I agree that that is the low hanging fruit. We just have to let the user know exactly what we are trying to promise. bq. there would need to be some "backing out" work to remove an update that succeeded on the leader but failed on the replicas. Yup - that will be the hardest part of doing this how we would really like and a large reason it was punted on in all the initial work. Even if the leader didn't process the doc first (which is likely a doable optimization at some point), I still think it's really hard. bq. Lastly, batches! What happens if half of a batch (sent by a client) succeeds and the other half fails (due to losing a replica in the middle of processing the batch)? Batches and streaming really don't make sense yet in SolrCloud other than for batch loading. We need to implement better, fine grained responses first. When that happens, it should all operate the same as single update per request. > Option to enforce a majority quorum approach to accepting updates in SolrCloud > -- > > Key: SOLR-5468 > URL: https://issues.apache.org/jira/browse/SOLR-5468 > Project: Solr > Issue Type: New Feature > Components: SolrCloud >Affects Versions: 4.5 > Environment: All >Reporter: Timothy Potter >Assignee: Timothy Potter >Priority: Minor > > I've been thinking about how SolrCloud deals with write-availability using > in-sync replica sets, in which writes will continue to be accepted so long as > there is at least one healthy node per shard. > For a little background (and to verify my understanding of the process is > correct), SolrCloud only considers active/healthy replicas when acknowledging > a write. Specifically, when a shard leader accepts an update request, it > forwards the request to all active/healthy replicas and only considers the > write successful if all active/healthy replicas ack the write. Any down / > gone replicas are not considered and will sync up with the leader when they > come back online using peer sync or snapshot replication. For instance, if a > shard has 3 nodes, A, B, C with A being the current leader, then writes to > the shard will continue to succeed even if B & C are down. > The issue is that if a shard leader continues to accept updates even if it > loses all of its replicas, then we have acknowledged updates on only 1 node. > If that node, call it A, then fails and one of the previous replicas, call it > B, comes back online before A does, then any writes that A accepted while the > other replicas were offline are at risk to being lost. > SolrCloud does provide a safe-guard mechanism for this problem with the > leaderVoteWait setting, which puts any replicas that come back online before > node A into a temporary wait state. If A comes back online within the wait > period, then all is well as it will become the leader again and no writes > will be lost. As a side note, sys admins definitely need to be made more > aware of this situation as when I first encountered it in my cluster, I had > no idea what it meant. > My question is whether we want to consider an approach where SolrCloud will > not accept writes unless there is a majority of replicas available to accept > the write? For my example, under this approach, we wouldn't accept writes if > both B&C failed, but would if only C did, leaving A & B online. Admittedly, > this lowers the write-availability of the system, so may be something that > should be tunable? > From Mark M: Yeah, this is kind of like one of many little features that we > have just not gotten to yet. I’ve always planned for a param that let’s you > say how many replicas an update must be verified on before responding > success. Seems to make sense to fail that type of request early if you notice > there are not enough replicas up to satisfy the param to begin with. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5468) Option to enforce a majority quorum approach to accepting updates in SolrCloud
[ https://issues.apache.org/jira/browse/SOLR-5468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13983016#comment-13983016 ] Timothy Potter commented on SOLR-5468: -- Starting to work on this ... First, I think "majority quorum" is too strong for what we really need at the moment; for now it seems sufficient to let users decide how many replicas a write must succeed on to be considered successful. In other words, we can introduce a new, optional integer property when creating a new collection - minActiveReplicas (need a better name), which defaults to 1 (current behavior). If >1, then an update won't succeed unless it is ack'd by at least that many replicas. Activating this feature doesn't make much sense unless a collection has RF > 2. The biggest hurdle to adding this behavior is the asynchronous / streaming based approach leaders use to forward updates on to replicas. The current implementation uses a callback error handler to deal with failed update requests (from leader to replica) and simply considers an update successful if it works on the leader. Part of the complexity is that the leader processes the update before even attempting to forward on to the replica so there would need to be some "backing out" work to remove an update that succeeded on the leader but failed on the replicas. This is starting to get messy ;-) Another key point here is this feature simply moves the problem from the Solr server to the client application, i.e. it's a fail-faster approach where a client indexing app gets notified that writes are not succeeding on enough replicas to meet the desired threshold. The client application still has to decide what to do when writes fail. Lastly, batches! What happens if half of a batch (sent by a client) succeeds and the other half fails (due to losing a replica in the middle of processing the batch)? Another idea I had is maybe this isn't a collection-level property, maybe it is set on a per-request basis? > Option to enforce a majority quorum approach to accepting updates in SolrCloud > -- > > Key: SOLR-5468 > URL: https://issues.apache.org/jira/browse/SOLR-5468 > Project: Solr > Issue Type: New Feature > Components: SolrCloud >Affects Versions: 4.5 > Environment: All >Reporter: Timothy Potter >Priority: Minor > > I've been thinking about how SolrCloud deals with write-availability using > in-sync replica sets, in which writes will continue to be accepted so long as > there is at least one healthy node per shard. > For a little background (and to verify my understanding of the process is > correct), SolrCloud only considers active/healthy replicas when acknowledging > a write. Specifically, when a shard leader accepts an update request, it > forwards the request to all active/healthy replicas and only considers the > write successful if all active/healthy replicas ack the write. Any down / > gone replicas are not considered and will sync up with the leader when they > come back online using peer sync or snapshot replication. For instance, if a > shard has 3 nodes, A, B, C with A being the current leader, then writes to > the shard will continue to succeed even if B & C are down. > The issue is that if a shard leader continues to accept updates even if it > loses all of its replicas, then we have acknowledged updates on only 1 node. > If that node, call it A, then fails and one of the previous replicas, call it > B, comes back online before A does, then any writes that A accepted while the > other replicas were offline are at risk to being lost. > SolrCloud does provide a safe-guard mechanism for this problem with the > leaderVoteWait setting, which puts any replicas that come back online before > node A into a temporary wait state. If A comes back online within the wait > period, then all is well as it will become the leader again and no writes > will be lost. As a side note, sys admins definitely need to be made more > aware of this situation as when I first encountered it in my cluster, I had > no idea what it meant. > My question is whether we want to consider an approach where SolrCloud will > not accept writes unless there is a majority of replicas available to accept > the write? For my example, under this approach, we wouldn't accept writes if > both B&C failed, but would if only C did, leaving A & B online. Admittedly, > this lowers the write-availability of the system, so may be something that > should be tunable? > From Mark M: Yeah, this is kind of like one of many little features that we > have just not gotten to yet. I’ve always planned for a param that let’s you > say how many replicas an update must be verified on before responding > success. Seems to make se
[jira] [Commented] (SOLR-5468) Option to enforce a majority quorum approach to accepting updates in SolrCloud
[ https://issues.apache.org/jira/browse/SOLR-5468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13830833#comment-13830833 ] Mark Miller commented on SOLR-5468: --- {quote} Specifically, when a shard leader accepts an update request, it forwards the request to all active/healthy replicas and only considers the write successful if all active/healthy replicas ack the write.{quote} In fact, because we add locally first, we expect a remote fail means the remote node has to recover, and so we consider that success as far as the user is concerned. Internally, we also try and force a recovery on the remote node that failed. We presume it's screwed and has to recover and we try to ensure it by trying to tell it to recover in case it's still up. I want to add more to this, see SOLR-5495. SOLR-4992 is also important. This JIRA issue is an important addition to the current design though. In the code, this is doc'd as: {noformat} // if its a forward, any fail is a problem - // otherwise we assume things are fine if we got it locally // until we start allowing min replication param {noformat} Of course, we don't just assume things are fine - we also try and put the node that rejected the update into recovery. > Option to enforce a majority quorum approach to accepting updates in SolrCloud > -- > > Key: SOLR-5468 > URL: https://issues.apache.org/jira/browse/SOLR-5468 > Project: Solr > Issue Type: New Feature > Components: SolrCloud >Affects Versions: 4.5 > Environment: All >Reporter: Timothy Potter >Priority: Minor > > I've been thinking about how SolrCloud deals with write-availability using > in-sync replica sets, in which writes will continue to be accepted so long as > there is at least one healthy node per shard. > For a little background (and to verify my understanding of the process is > correct), SolrCloud only considers active/healthy replicas when acknowledging > a write. Specifically, when a shard leader accepts an update request, it > forwards the request to all active/healthy replicas and only considers the > write successful if all active/healthy replicas ack the write. Any down / > gone replicas are not considered and will sync up with the leader when they > come back online using peer sync or snapshot replication. For instance, if a > shard has 3 nodes, A, B, C with A being the current leader, then writes to > the shard will continue to succeed even if B & C are down. > The issue is that if a shard leader continues to accept updates even if it > loses all of its replicas, then we have acknowledged updates on only 1 node. > If that node, call it A, then fails and one of the previous replicas, call it > B, comes back online before A does, then any writes that A accepted while the > other replicas were offline are at risk to being lost. > SolrCloud does provide a safe-guard mechanism for this problem with the > leaderVoteWait setting, which puts any replicas that come back online before > node A into a temporary wait state. If A comes back online within the wait > period, then all is well as it will become the leader again and no writes > will be lost. As a side note, sys admins definitely need to be made more > aware of this situation as when I first encountered it in my cluster, I had > no idea what it meant. > My question is whether we want to consider an approach where SolrCloud will > not accept writes unless there is a majority of replicas available to accept > the write? For my example, under this approach, we wouldn't accept writes if > both B&C failed, but would if only C did, leaving A & B online. Admittedly, > this lowers the write-availability of the system, so may be something that > should be tunable? > From Mark M: Yeah, this is kind of like one of many little features that we > have just not gotten to yet. I’ve always planned for a param that let’s you > say how many replicas an update must be verified on before responding > success. Seems to make sense to fail that type of request early if you notice > there are not enough replicas up to satisfy the param to begin with. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org