[jira] [Commented] (SOLR-8069) Leader Initiated Recovery can put the replica with the latest data into LIR and a shard will have no leader even on restart.

2015-09-19 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14877082#comment-14877082
 ] 

Mark Miller commented on SOLR-8069:
---

bq. If you believe that this is true, I do agree that your patch will 
accomplish the check that at the moment you're setting someone else down, 
you're the leader. 

If the leader cannot set a replica into LIR at any time for any reason, I think 
we have trouble in general.

I'm not sure I fully follow the rest. I can't wrap my head around LIR causing 
requests to fail or not...that doesn't make a lot of sense to me.

> Leader Initiated Recovery can put the replica with the latest data into LIR 
> and a shard will have no leader even on restart.
> 
>
> Key: SOLR-8069
> URL: https://issues.apache.org/jira/browse/SOLR-8069
> Project: Solr
>  Issue Type: Bug
>Reporter: Mark Miller
> Attachments: SOLR-8069.patch, SOLR-8069.patch
>
>
> I've seen this twice now. Need to work on a test.
> When some issues hit all the replicas at once, you can end up in a situation 
> where the rightful leader was put or put itself into LIR. Even on restart, 
> this rightful leader won't take leadership and you have to manually clear the 
> LIR nodes.
> It seems that if all the replicas participate in election on startup, LIR 
> should just be cleared.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8069) Leader Initiated Recovery can put the replica with the latest data into LIR and a shard will have no leader even on restart.

2015-09-18 Thread Ramkumar Aiyengar (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14876205#comment-14876205
 ] 

Ramkumar Aiyengar commented on SOLR-8069:
-

That makes sense, but for my understanding, why is it a bad idea even if not 
everyone is participating?

> Leader Initiated Recovery can put the replica with the latest data into LIR 
> and a shard will have no leader even on restart.
> 
>
> Key: SOLR-8069
> URL: https://issues.apache.org/jira/browse/SOLR-8069
> Project: Solr
>  Issue Type: Bug
>Reporter: Mark Miller
> Attachments: SOLR-8069.patch, SOLR-8069.patch
>
>
> I've seen this twice now. Need to work on a test.
> When some issues hit all the replicas at once, you can end up in a situation 
> where the rightful leader was put or put itself into LIR. Even on restart, 
> this rightful leader won't take leadership and you have to manually clear the 
> LIR nodes.
> It seems that if all the replicas participate in election on startup, LIR 
> should just be cleared.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8069) Leader Initiated Recovery can put the replica with the latest data into LIR and a shard will have no leader even on restart.

2015-09-18 Thread Ramkumar Aiyengar (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14876697#comment-14876697
 ] 

Ramkumar Aiyengar commented on SOLR-8069:
-

R1 was not in LIR, but it came up while R2 was still at the lead and decided to 
recover, before R2 stepped down due to being in LIR.

> Leader Initiated Recovery can put the replica with the latest data into LIR 
> and a shard will have no leader even on restart.
> 
>
> Key: SOLR-8069
> URL: https://issues.apache.org/jira/browse/SOLR-8069
> Project: Solr
>  Issue Type: Bug
>Reporter: Mark Miller
> Attachments: SOLR-8069.patch, SOLR-8069.patch
>
>
> I've seen this twice now. Need to work on a test.
> When some issues hit all the replicas at once, you can end up in a situation 
> where the rightful leader was put or put itself into LIR. Even on restart, 
> this rightful leader won't take leadership and you have to manually clear the 
> LIR nodes.
> It seems that if all the replicas participate in election on startup, LIR 
> should just be cleared.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8069) Leader Initiated Recovery can put the replica with the latest data into LIR and a shard will have no leader even on restart.

2015-09-18 Thread Ramkumar Aiyengar (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14876133#comment-14876133
 ] 

Ramkumar Aiyengar commented on SOLR-8069:
-

Late to the party here.. We experienced the same issue, and [~cpoerschke] was 
trying to create a test case for this. My initial thought was why we even check 
LIR when we are about to become the leader? Shouldn't the double way sync cover 
us even if we are behind due to losing documents?

> Leader Initiated Recovery can put the replica with the latest data into LIR 
> and a shard will have no leader even on restart.
> 
>
> Key: SOLR-8069
> URL: https://issues.apache.org/jira/browse/SOLR-8069
> Project: Solr
>  Issue Type: Bug
>Reporter: Mark Miller
> Attachments: SOLR-8069.patch, SOLR-8069.patch
>
>
> I've seen this twice now. Need to work on a test.
> When some issues hit all the replicas at once, you can end up in a situation 
> where the rightful leader was put or put itself into LIR. Even on restart, 
> this rightful leader won't take leadership and you have to manually clear the 
> LIR nodes.
> It seems that if all the replicas participate in election on startup, LIR 
> should just be cleared.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8069) Leader Initiated Recovery can put the replica with the latest data into LIR and a shard will have no leader even on restart.

2015-09-18 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14876057#comment-14876057
 ] 

Mark Miller commented on SOLR-8069:
---

bq. predicate the ZK transaction on the election node 

As I think about this, I think I really prefer this method - with this, we use 
ZK to ensure *ONLY* the leader can put a replica into LIR. It doesn't matter 
what clumsy things happen elsewhere in the code, with this multi, only one 
replica in the shard, only the leader as recently properly enforced by ZK will 
be able to put a replica into LIR. I like that property vs a multi on election 
nodes.



> Leader Initiated Recovery can put the replica with the latest data into LIR 
> and a shard will have no leader even on restart.
> 
>
> Key: SOLR-8069
> URL: https://issues.apache.org/jira/browse/SOLR-8069
> Project: Solr
>  Issue Type: Bug
>Reporter: Mark Miller
> Attachments: SOLR-8069.patch, SOLR-8069.patch
>
>
> I've seen this twice now. Need to work on a test.
> When some issues hit all the replicas at once, you can end up in a situation 
> where the rightful leader was put or put itself into LIR. Even on restart, 
> this rightful leader won't take leadership and you have to manually clear the 
> LIR nodes.
> It seems that if all the replicas participate in election on startup, LIR 
> should just be cleared.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8069) Leader Initiated Recovery can put the replica with the latest data into LIR and a shard will have no leader even on restart.

2015-09-18 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14876236#comment-14876236
 ] 

Mark Miller commented on SOLR-8069:
---

I guess I worry about cases where a bad replica was marked as LIR by the leader 
and the shard goes down. It comes back with two nodes that were LIR but not the 
good replicas - do we want one of them to become the leader and lose data? We 
know they are probably not good actual leader candidates and the best way to 
prevent data loss is manual intervention if possible.

> Leader Initiated Recovery can put the replica with the latest data into LIR 
> and a shard will have no leader even on restart.
> 
>
> Key: SOLR-8069
> URL: https://issues.apache.org/jira/browse/SOLR-8069
> Project: Solr
>  Issue Type: Bug
>Reporter: Mark Miller
> Attachments: SOLR-8069.patch, SOLR-8069.patch
>
>
> I've seen this twice now. Need to work on a test.
> When some issues hit all the replicas at once, you can end up in a situation 
> where the rightful leader was put or put itself into LIR. Even on restart, 
> this rightful leader won't take leadership and you have to manually clear the 
> LIR nodes.
> It seems that if all the replicas participate in election on startup, LIR 
> should just be cleared.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8069) Leader Initiated Recovery can put the replica with the latest data into LIR and a shard will have no leader even on restart.

2015-09-18 Thread Ramkumar Aiyengar (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14876156#comment-14876156
 ] 

Ramkumar Aiyengar commented on SOLR-8069:
-

The case we hit was when we cold stopped/started the cloud. This was on 4.10.4, 
so may not be valid now. Let's say you have R1 and R2.

* R1 is the leader and both R1 and R2 are stopped at the same time.
* R2's stops accepting requests but hasn't updated ZK as yet, when R1 sends a 
update to R2, it fails and puts R2 in LIR.
* R2 shuts down first, then R1.
* R1 starts up first, finds it should be the leader.
* R2 decides it should follow and tries to recover.
* R1 decides it can't be leader due to LIR and steps down. But by then R2 is in 
recovery, doesn't step up, and we have no one stepping forward.

> Leader Initiated Recovery can put the replica with the latest data into LIR 
> and a shard will have no leader even on restart.
> 
>
> Key: SOLR-8069
> URL: https://issues.apache.org/jira/browse/SOLR-8069
> Project: Solr
>  Issue Type: Bug
>Reporter: Mark Miller
> Attachments: SOLR-8069.patch, SOLR-8069.patch
>
>
> I've seen this twice now. Need to work on a test.
> When some issues hit all the replicas at once, you can end up in a situation 
> where the rightful leader was put or put itself into LIR. Even on restart, 
> this rightful leader won't take leadership and you have to manually clear the 
> LIR nodes.
> It seems that if all the replicas participate in election on startup, LIR 
> should just be cleared.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8069) Leader Initiated Recovery can put the replica with the latest data into LIR and a shard will have no leader even on restart.

2015-09-18 Thread Ramkumar Aiyengar (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14876271#comment-14876271
 ] 

Ramkumar Aiyengar commented on SOLR-8069:
-

Got it. You've to include those in recovery along with the participants since 
the ones which have gone into recovery are not going to help in anyway (an 
alternative would be for them to abort recovery and rejoin of no one is 
around). But in your case, if I understand it right, detecting that there are 
down replicas (which might come back as good leaders) would certainly be a good 
idea.

> Leader Initiated Recovery can put the replica with the latest data into LIR 
> and a shard will have no leader even on restart.
> 
>
> Key: SOLR-8069
> URL: https://issues.apache.org/jira/browse/SOLR-8069
> Project: Solr
>  Issue Type: Bug
>Reporter: Mark Miller
> Attachments: SOLR-8069.patch, SOLR-8069.patch
>
>
> I've seen this twice now. Need to work on a test.
> When some issues hit all the replicas at once, you can end up in a situation 
> where the rightful leader was put or put itself into LIR. Even on restart, 
> this rightful leader won't take leadership and you have to manually clear the 
> LIR nodes.
> It seems that if all the replicas participate in election on startup, LIR 
> should just be cleared.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8069) Leader Initiated Recovery can put the replica with the latest data into LIR and a shard will have no leader even on restart.

2015-09-18 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14876184#comment-14876184
 ] 

Mark Miller commented on SOLR-8069:
---

bq. initial thought was why we even check LIR when we are about to become the 
leader?

I think if everyone participates in the election that makes sense. I've started 
working on that as a separate patch.

I still like the idea of making it so that by zk decree only the current leader 
can put a replica into LIR as one of two improvements.

> Leader Initiated Recovery can put the replica with the latest data into LIR 
> and a shard will have no leader even on restart.
> 
>
> Key: SOLR-8069
> URL: https://issues.apache.org/jira/browse/SOLR-8069
> Project: Solr
>  Issue Type: Bug
>Reporter: Mark Miller
> Attachments: SOLR-8069.patch, SOLR-8069.patch
>
>
> I've seen this twice now. Need to work on a test.
> When some issues hit all the replicas at once, you can end up in a situation 
> where the rightful leader was put or put itself into LIR. Even on restart, 
> this rightful leader won't take leadership and you have to manually clear the 
> LIR nodes.
> It seems that if all the replicas participate in election on startup, LIR 
> should just be cleared.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8069) Leader Initiated Recovery can put the replica with the latest data into LIR and a shard will have no leader even on restart.

2015-09-18 Thread Jessica Cheng Mallet (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14876879#comment-14876879
 ] 

Jessica Cheng Mallet commented on SOLR-8069:


bq. I think it does. A leader can do this. It doesn't matter if it had a valid 
reason to do it or not.
If you believe that this is true, I do agree that your patch will accomplish 
the check that at the moment you're setting someone else down, you're the 
leader. If we're going with this policy though, I think if at this moment it 
realizes that it's not the leader, it should actually fail the request because 
it shouldn't accept it on the real leader's behalf. E.g. if it's a node that 
was a leader but has just been network-partitioned off (but clusterstate change 
hasn't been made since it's asynchronous) and wasn't able to actually forward 
the request to the real leader.

> Leader Initiated Recovery can put the replica with the latest data into LIR 
> and a shard will have no leader even on restart.
> 
>
> Key: SOLR-8069
> URL: https://issues.apache.org/jira/browse/SOLR-8069
> Project: Solr
>  Issue Type: Bug
>Reporter: Mark Miller
> Attachments: SOLR-8069.patch, SOLR-8069.patch
>
>
> I've seen this twice now. Need to work on a test.
> When some issues hit all the replicas at once, you can end up in a situation 
> where the rightful leader was put or put itself into LIR. Even on restart, 
> this rightful leader won't take leadership and you have to manually clear the 
> LIR nodes.
> It seems that if all the replicas participate in election on startup, LIR 
> should just be cleared.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8069) Leader Initiated Recovery can put the replica with the latest data into LIR and a shard will have no leader even on restart.

2015-09-18 Thread Timothy Potter (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14876419#comment-14876419
 ] 

Timothy Potter commented on SOLR-8069:
--

Hi Ram,

In your scenario, why would R1 be in LIR? What put it there?

> Leader Initiated Recovery can put the replica with the latest data into LIR 
> and a shard will have no leader even on restart.
> 
>
> Key: SOLR-8069
> URL: https://issues.apache.org/jira/browse/SOLR-8069
> Project: Solr
>  Issue Type: Bug
>Reporter: Mark Miller
> Attachments: SOLR-8069.patch, SOLR-8069.patch
>
>
> I've seen this twice now. Need to work on a test.
> When some issues hit all the replicas at once, you can end up in a situation 
> where the rightful leader was put or put itself into LIR. Even on restart, 
> this rightful leader won't take leadership and you have to manually clear the 
> LIR nodes.
> It seems that if all the replicas participate in election on startup, LIR 
> should just be cleared.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8069) Leader Initiated Recovery can put the replica with the latest data into LIR and a shard will have no leader even on restart.

2015-09-18 Thread Shalin Shekhar Mangar (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14847287#comment-14847287
 ] 

Shalin Shekhar Mangar commented on SOLR-8069:
-

bq. Hmmm, reading the code now I'm not sure it's doing exactly the right thing 
since it calls getLeaderSeqPath, which just takes the current ElectionContext 
from electionContexts, which isn't necessarily the one the node had when it 
decided to mark someone else down, right? Shalin Shekhar Mangar thoughts?

Right, we should acquire the leader sequence path at the beginning of the 
update instead of so late in the game. I believe Mark's patch has the same 
problem but it is somewhat diluted by checking against CloudDescriptor.isLeader.

> Leader Initiated Recovery can put the replica with the latest data into LIR 
> and a shard will have no leader even on restart.
> 
>
> Key: SOLR-8069
> URL: https://issues.apache.org/jira/browse/SOLR-8069
> Project: Solr
>  Issue Type: Bug
>Reporter: Mark Miller
> Attachments: SOLR-8069.patch, SOLR-8069.patch
>
>
> I've seen this twice now. Need to work on a test.
> When some issues hit all the replicas at once, you can end up in a situation 
> where the rightful leader was put or put itself into LIR. Even on restart, 
> this rightful leader won't take leadership and you have to manually clear the 
> LIR nodes.
> It seems that if all the replicas participate in election on startup, LIR 
> should just be cleared.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8069) Leader Initiated Recovery can put the replica with the latest data into LIR and a shard will have no leader even on restart.

2015-09-18 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14852717#comment-14852717
 ] 

Mark Miller commented on SOLR-8069:
---

The difference is that this patch ensures we are still the leader when we get 
the context - rather than blindly getting the current context.

bq. is somewhat diluted 

I think it goes from being a large hole still to closed really. Someone might 
have another idea for an improvement, but I don't see the scenario that really 
sneaks by this yet.

bq. My question is if it's absolutely safe for this node to set the other node 
in LiR simply because it's the leader now,

I think of course it is. It's valid for the leader and only the leader to set 
anyone as down.

> Leader Initiated Recovery can put the replica with the latest data into LIR 
> and a shard will have no leader even on restart.
> 
>
> Key: SOLR-8069
> URL: https://issues.apache.org/jira/browse/SOLR-8069
> Project: Solr
>  Issue Type: Bug
>Reporter: Mark Miller
> Attachments: SOLR-8069.patch, SOLR-8069.patch
>
>
> I've seen this twice now. Need to work on a test.
> When some issues hit all the replicas at once, you can end up in a situation 
> where the rightful leader was put or put itself into LIR. Even on restart, 
> this rightful leader won't take leadership and you have to manually clear the 
> LIR nodes.
> It seems that if all the replicas participate in election on startup, LIR 
> should just be cleared.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8069) Leader Initiated Recovery can put the replica with the latest data into LIR and a shard will have no leader even on restart.

2015-09-18 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14866031#comment-14866031
 ] 

Mark Miller commented on SOLR-8069:
---

bq. thus I prefer the simple logic of "do this action only if our zookeeper 
session state is exactly what it was when we decided to do it". Anyhow, this is 
probably beyond the scope of this JIRA.

I don't see an easy way to do that in this case. Almost all the solutions that 
fit with the code have the exact same holes / races. I think the local leader 
check around getting the leader context is the strongest thing I can think of 
so far other than adding further defensive checks.

I don't know that much more is needed though. If the context returned is from 
the leader, great, its zkparentversion will will match. If the context is 
somehow not the right one, it won't match. We get a context and only if it's 
the context for the leader in ZK do we do anything rather than just if the 
context has a node in line. I'd say that is a pretty strong improvement.

This should only work the node is a valid leader by it's local state and by 
ZooKeeper.

> Leader Initiated Recovery can put the replica with the latest data into LIR 
> and a shard will have no leader even on restart.
> 
>
> Key: SOLR-8069
> URL: https://issues.apache.org/jira/browse/SOLR-8069
> Project: Solr
>  Issue Type: Bug
>Reporter: Mark Miller
> Attachments: SOLR-8069.patch, SOLR-8069.patch
>
>
> I've seen this twice now. Need to work on a test.
> When some issues hit all the replicas at once, you can end up in a situation 
> where the rightful leader was put or put itself into LIR. Even on restart, 
> this rightful leader won't take leadership and you have to manually clear the 
> LIR nodes.
> It seems that if all the replicas participate in election on startup, LIR 
> should just be cleared.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8069) Leader Initiated Recovery can put the replica with the latest data into LIR and a shard will have no leader even on restart.

2015-09-18 Thread Jessica Cheng Mallet (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14875931#comment-14875931
 ] 

Jessica Cheng Mallet commented on SOLR-8069:


bq. I think of course it is. It's valid for the leader and only the leader to 
set anyone as down.

It's definitely only valid for the leader to set anyone down, but it doesn't 
mean that the leader should set someone down based on old leadership decision. 
This is the only place I'm unsure about.

bq. I don't see an easy way to do that in this case. Almost all the solutions 
that fit with the code have the exact same holes / races.

If we're willing to make more changes, one way I see this work is to write down 
the election node path as a prop in the leader znode (this is now written via 
zk transaction from your other commit). Then, have the isLeader logic in 
DistributedUpdateProcessor be based on reading the leader znode, and at that 
point record down the election node path as well. Then, when setting LiR, 
predicate the ZK transaction on the election node path read in the beginning of 
DistributedUpdateProcessor.

> Leader Initiated Recovery can put the replica with the latest data into LIR 
> and a shard will have no leader even on restart.
> 
>
> Key: SOLR-8069
> URL: https://issues.apache.org/jira/browse/SOLR-8069
> Project: Solr
>  Issue Type: Bug
>Reporter: Mark Miller
> Attachments: SOLR-8069.patch, SOLR-8069.patch
>
>
> I've seen this twice now. Need to work on a test.
> When some issues hit all the replicas at once, you can end up in a situation 
> where the rightful leader was put or put itself into LIR. Even on restart, 
> this rightful leader won't take leadership and you have to manually clear the 
> LIR nodes.
> It seems that if all the replicas participate in election on startup, LIR 
> should just be cleared.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8069) Leader Initiated Recovery can put the replica with the latest data into LIR and a shard will have no leader even on restart.

2015-09-18 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14875960#comment-14875960
 ] 

Mark Miller commented on SOLR-8069:
---

bq. but it doesn't mean that the leader should set someone down based on old 
leadership decision.

I think it does. A leader can do this. It doesn't matter if it had a valid 
reason to do it or not.

> Leader Initiated Recovery can put the replica with the latest data into LIR 
> and a shard will have no leader even on restart.
> 
>
> Key: SOLR-8069
> URL: https://issues.apache.org/jira/browse/SOLR-8069
> Project: Solr
>  Issue Type: Bug
>Reporter: Mark Miller
> Attachments: SOLR-8069.patch, SOLR-8069.patch
>
>
> I've seen this twice now. Need to work on a test.
> When some issues hit all the replicas at once, you can end up in a situation 
> where the rightful leader was put or put itself into LIR. Even on restart, 
> this rightful leader won't take leadership and you have to manually clear the 
> LIR nodes.
> It seems that if all the replicas participate in election on startup, LIR 
> should just be cleared.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8069) Leader Initiated Recovery can put the replica with the latest data into LIR and a shard will have no leader even on restart.

2015-09-18 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14876022#comment-14876022
 ] 

Mark Miller commented on SOLR-8069:
---

bq.  one way I see this 

That really seems the same as just getting the context earlier in the request.

Given the different ways LIR might be started and used, it really seemed 
simpler to try and localize the changes rather than tie them more into the 
request lifecycle.

> Leader Initiated Recovery can put the replica with the latest data into LIR 
> and a shard will have no leader even on restart.
> 
>
> Key: SOLR-8069
> URL: https://issues.apache.org/jira/browse/SOLR-8069
> Project: Solr
>  Issue Type: Bug
>Reporter: Mark Miller
> Attachments: SOLR-8069.patch, SOLR-8069.patch
>
>
> I've seen this twice now. Need to work on a test.
> When some issues hit all the replicas at once, you can end up in a situation 
> where the rightful leader was put or put itself into LIR. Even on restart, 
> this rightful leader won't take leadership and you have to manually clear the 
> LIR nodes.
> It seems that if all the replicas participate in election on startup, LIR 
> should just be cleared.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8069) Leader Initiated Recovery can put the replica with the latest data into LIR and a shard will have no leader even on restart.

2015-09-17 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14803093#comment-14803093
 ] 

Mark Miller commented on SOLR-8069:
---

[~thelabdude], any immediate thoughts on this?

> Leader Initiated Recovery can put the replica with the latest data into LIR 
> and a shard will have no leader even on restart.
> 
>
> Key: SOLR-8069
> URL: https://issues.apache.org/jira/browse/SOLR-8069
> Project: Solr
>  Issue Type: Bug
>Reporter: Mark Miller
>
> I've seen this twice now. Need to work on a test.
> When some issues hit all the replicas at once, you can end up in a situation 
> where the rightful leader was put or put itself into LIR. Even on restart, 
> this rightful leader won't take leadership and you have to manually clear the 
> LIR nodes.
> It seems that if all the replicas participate in election on startup, LIR 
> should just be cleared.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8069) Leader Initiated Recovery can put the replica with the latest data into LIR and a shard will have no leader even on restart.

2015-09-17 Thread Jessica Cheng Mallet (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14803172#comment-14803172
 ] 

Jessica Cheng Mallet commented on SOLR-8069:


We have definitely seen this as well, even after commit for SOLR-7109 added 
zookeeper multi transaction to ZkController.markShardAsDownIfLeader, which is 
supposed to predicate setting the LiR node on the setter's still having the 
same election znode it thinks it has when it's a leader.

Hmmm, reading the code now I'm not sure it's doing exactly the right thing 
since it calls getLeaderSeqPath, which just takes the current ElectionContext 
from electionContexts, which isn't necessarily the one the node had when it 
decided to mark someone else down, right? [~shalinmangar] thoughts?

> Leader Initiated Recovery can put the replica with the latest data into LIR 
> and a shard will have no leader even on restart.
> 
>
> Key: SOLR-8069
> URL: https://issues.apache.org/jira/browse/SOLR-8069
> Project: Solr
>  Issue Type: Bug
>Reporter: Mark Miller
>
> I've seen this twice now. Need to work on a test.
> When some issues hit all the replicas at once, you can end up in a situation 
> where the rightful leader was put or put itself into LIR. Even on restart, 
> this rightful leader won't take leadership and you have to manually clear the 
> LIR nodes.
> It seems that if all the replicas participate in election on startup, LIR 
> should just be cleared.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8069) Leader Initiated Recovery can put the replica with the latest data into LIR and a shard will have no leader even on restart.

2015-09-17 Thread Timothy Potter (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14803264#comment-14803264
 ] 

Timothy Potter commented on SOLR-8069:
--

Immediate thought is ugh! I'm surprised to hear this is still happening after 
7109. I'd like to dig in a bit more, but agreed on:

bq. It seems that if all the replicas participate in election on startup, LIR 
should just be cleared.



> Leader Initiated Recovery can put the replica with the latest data into LIR 
> and a shard will have no leader even on restart.
> 
>
> Key: SOLR-8069
> URL: https://issues.apache.org/jira/browse/SOLR-8069
> Project: Solr
>  Issue Type: Bug
>Reporter: Mark Miller
>
> I've seen this twice now. Need to work on a test.
> When some issues hit all the replicas at once, you can end up in a situation 
> where the rightful leader was put or put itself into LIR. Even on restart, 
> this rightful leader won't take leadership and you have to manually clear the 
> LIR nodes.
> It seems that if all the replicas participate in election on startup, LIR 
> should just be cleared.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8069) Leader Initiated Recovery can put the replica with the latest data into LIR and a shard will have no leader even on restart.

2015-09-17 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14804446#comment-14804446
 ] 

Mark Miller commented on SOLR-8069:
---

bq. I still struggle with the safety of getting the ElectionContext from 
electionContexts, because what's mapped there could change from under this 
thread. 

That is why I check before and after we get the context that we locally think 
we are the leader. The idea is, if we locally are connected to zk and think we 
are leader before and after getting the latest context, we have near real 
confidence that we are the leader and can do still do as we please.

There really is nothing tricky about the leader being advertised in 
clusterstate - it's simply slightly stale state that is updated by Overseer. I 
don't see how it complicates an approach to this?

> Leader Initiated Recovery can put the replica with the latest data into LIR 
> and a shard will have no leader even on restart.
> 
>
> Key: SOLR-8069
> URL: https://issues.apache.org/jira/browse/SOLR-8069
> Project: Solr
>  Issue Type: Bug
>Reporter: Mark Miller
> Attachments: SOLR-8069.patch
>
>
> I've seen this twice now. Need to work on a test.
> When some issues hit all the replicas at once, you can end up in a situation 
> where the rightful leader was put or put itself into LIR. Even on restart, 
> this rightful leader won't take leadership and you have to manually clear the 
> LIR nodes.
> It seems that if all the replicas participate in election on startup, LIR 
> should just be cleared.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8069) Leader Initiated Recovery can put the replica with the latest data into LIR and a shard will have no leader even on restart.

2015-09-17 Thread Jessica Cheng Mallet (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14804438#comment-14804438
 ] 

Jessica Cheng Mallet commented on SOLR-8069:


Actually, thinking about it -- why do we have the leader property in cluster 
state at all? If it's simply to publish leadership to solrj, it seems that on 
the server-side we should still use the leader znode as the "source of truth" 
so that we can have guarantees of consistent view along with the zk 
transactions. If solrj's view falls behind due to the asynchronous nature of 
having the Overseer update the state, at least on the server side we can check 
the leader znode.

Any historical reason why leadership information is in two places?

> Leader Initiated Recovery can put the replica with the latest data into LIR 
> and a shard will have no leader even on restart.
> 
>
> Key: SOLR-8069
> URL: https://issues.apache.org/jira/browse/SOLR-8069
> Project: Solr
>  Issue Type: Bug
>Reporter: Mark Miller
> Attachments: SOLR-8069.patch
>
>
> I've seen this twice now. Need to work on a test.
> When some issues hit all the replicas at once, you can end up in a situation 
> where the rightful leader was put or put itself into LIR. Even on restart, 
> this rightful leader won't take leadership and you have to manually clear the 
> LIR nodes.
> It seems that if all the replicas participate in election on startup, LIR 
> should just be cleared.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8069) Leader Initiated Recovery can put the replica with the latest data into LIR and a shard will have no leader even on restart.

2015-09-17 Thread Jessica Cheng Mallet (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14804407#comment-14804407
 ] 

Jessica Cheng Mallet commented on SOLR-8069:


I still struggle with the safety of getting the ElectionContext from 
electionContexts, because what's mapped there could change from under this 
thread. What about if we write down the election node path (e.g.  
238121947050958365-core_node2-n_06) into the leader znode as a leader 
props, so that whenever we're actually checking that we're the leader, we can 
get that election node path back and do the zk multi checking for that 
particular election node path?

Ugh, but then I guess lots of places are actually looking at the cluster 
state's leader instead of the leader node. >_< Why are there separate places 
for marking the leader? I don't know how to reason with the asynchronous nature 
of cluster state's update wrt actual leader election...

> Leader Initiated Recovery can put the replica with the latest data into LIR 
> and a shard will have no leader even on restart.
> 
>
> Key: SOLR-8069
> URL: https://issues.apache.org/jira/browse/SOLR-8069
> Project: Solr
>  Issue Type: Bug
>Reporter: Mark Miller
> Attachments: SOLR-8069.patch
>
>
> I've seen this twice now. Need to work on a test.
> When some issues hit all the replicas at once, you can end up in a situation 
> where the rightful leader was put or put itself into LIR. Even on restart, 
> this rightful leader won't take leadership and you have to manually clear the 
> LIR nodes.
> It seems that if all the replicas participate in election on startup, LIR 
> should just be cleared.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8069) Leader Initiated Recovery can put the replica with the latest data into LIR and a shard will have no leader even on restart.

2015-09-17 Thread Timothy Potter (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14804599#comment-14804599
 ] 

Timothy Potter commented on SOLR-8069:
--

Quick pass over the patch looks good to me (a few non-related changes in 
HdfsCollectionsAPIDistributedZkTest.java leaked into this patch). I'm focused 
on other un-related issue at the moment so will take a closer look in the AM 
when I'm fresh, but I like the approach.

> Leader Initiated Recovery can put the replica with the latest data into LIR 
> and a shard will have no leader even on restart.
> 
>
> Key: SOLR-8069
> URL: https://issues.apache.org/jira/browse/SOLR-8069
> Project: Solr
>  Issue Type: Bug
>Reporter: Mark Miller
> Attachments: SOLR-8069.patch, SOLR-8069.patch
>
>
> I've seen this twice now. Need to work on a test.
> When some issues hit all the replicas at once, you can end up in a situation 
> where the rightful leader was put or put itself into LIR. Even on restart, 
> this rightful leader won't take leadership and you have to manually clear the 
> LIR nodes.
> It seems that if all the replicas participate in election on startup, LIR 
> should just be cleared.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8069) Leader Initiated Recovery can put the replica with the latest data into LIR and a shard will have no leader even on restart.

2015-09-17 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14804686#comment-14804686
 ] 

Mark Miller commented on SOLR-8069:
---

I think the thought game comes down to:

We check if locally think we are the leader (which requires being connected to 
zk).

We get the current leader context.

We check if locally think we are the leader.

If all that passes, we assume we have context for when we were the leader. Now 
publishing only works if that same leader is registered.

So where are the holes?

There does not seem to be a lot of room to get the wrong context? In what 
scenario could we think we are the leader before and after the getContext call 
and end up with the wrong context?

 And if we have the leaders context, the multi update ensures the update only 
happens if that context is still the leader.

> Leader Initiated Recovery can put the replica with the latest data into LIR 
> and a shard will have no leader even on restart.
> 
>
> Key: SOLR-8069
> URL: https://issues.apache.org/jira/browse/SOLR-8069
> Project: Solr
>  Issue Type: Bug
>Reporter: Mark Miller
> Attachments: SOLR-8069.patch, SOLR-8069.patch
>
>
> I've seen this twice now. Need to work on a test.
> When some issues hit all the replicas at once, you can end up in a situation 
> where the rightful leader was put or put itself into LIR. Even on restart, 
> this rightful leader won't take leadership and you have to manually clear the 
> LIR nodes.
> It seems that if all the replicas participate in election on startup, LIR 
> should just be cleared.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8069) Leader Initiated Recovery can put the replica with the latest data into LIR and a shard will have no leader even on restart.

2015-09-17 Thread Jessica Cheng Mallet (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14804677#comment-14804677
 ] 

Jessica Cheng Mallet commented on SOLR-8069:


Yes, I think this is definitely an improvement. I'm just not sure if it gets 
everything covered. I suppose "we have near real confidence that we are the 
leader and can do still do as we please" is probably good enough -- though I 
haven't convinced myself yet through playing with complex scenarios of repeated 
leadership changes -- thus I prefer the simple logic of "do this action only if 
our zookeeper session state is exactly what it was when we decided to do it". 
Anyhow, this is probably beyond the scope of this JIRA.

BTW, we tend to see this most when a "bad" query is issued (e.g. doing 
non-cursorMark deep paging of page 50,000). Presumably it creates GC on each 
replica it hits (since the request is retried) and a series of leadership 
changes happen. Along with complication of GC pauses, the states are quite 
difficult to reason through. 

> Leader Initiated Recovery can put the replica with the latest data into LIR 
> and a shard will have no leader even on restart.
> 
>
> Key: SOLR-8069
> URL: https://issues.apache.org/jira/browse/SOLR-8069
> Project: Solr
>  Issue Type: Bug
>Reporter: Mark Miller
> Attachments: SOLR-8069.patch, SOLR-8069.patch
>
>
> I've seen this twice now. Need to work on a test.
> When some issues hit all the replicas at once, you can end up in a situation 
> where the rightful leader was put or put itself into LIR. Even on restart, 
> this rightful leader won't take leadership and you have to manually clear the 
> LIR nodes.
> It seems that if all the replicas participate in election on startup, LIR 
> should just be cleared.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8069) Leader Initiated Recovery can put the replica with the latest data into LIR and a shard will have no leader even on restart.

2015-09-17 Thread Jessica Cheng Mallet (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14804934#comment-14804934
 ] 

Jessica Cheng Mallet commented on SOLR-8069:


The scenario that I have in mind is if somehow we're switching leadership back 
and forth due to nodes going into GC after receiving retries of an expensive 
query, what if a node is a leader at time T1, decided to set another node in 
LiR but went to GC before it did, so that it lost the leadership. Then, the 
other node briefly gained leadership at T2 but then also went to GC and lost 
its leadership. Then, the first node wakes up from GC and became the leader 
once more at T3--and then this code execute. My question is if it's absolutely 
safe for this node to set the other node in LiR simply because it's the leader 
now, even though when it decided to set the LiR, it was the leader  at T1.

> Leader Initiated Recovery can put the replica with the latest data into LIR 
> and a shard will have no leader even on restart.
> 
>
> Key: SOLR-8069
> URL: https://issues.apache.org/jira/browse/SOLR-8069
> Project: Solr
>  Issue Type: Bug
>Reporter: Mark Miller
> Attachments: SOLR-8069.patch, SOLR-8069.patch
>
>
> I've seen this twice now. Need to work on a test.
> When some issues hit all the replicas at once, you can end up in a situation 
> where the rightful leader was put or put itself into LIR. Even on restart, 
> this rightful leader won't take leadership and you have to manually clear the 
> LIR nodes.
> It seems that if all the replicas participate in election on startup, LIR 
> should just be cleared.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org