> On April 13, 2016, 4:54 p.m., Bill Farner wrote:
> > src/main/java/org/apache/aurora/scheduler/discovery/CuratorSingletonService.java,
> >  line 80
> > <https://reviews.apache.org/r/46111/diff/1/?file=1341836#file1341836line80>
> >
> >     What would motivate us to do this?  Is there some race where the 
> > sequential ID from ZK might collide from 2 instances?

Its one of those sordid client corner cases that I don't think any of us ever 
even considered.
More info here:
  
http://curator.apache.org/apidocs/org/apache/curator/framework/api/CreateBuilderMain.html#withProtection--
  
The basic scenario is the client stays up but the server goes down with 
particularly bad timing during a node create.

Described in light of the leader election recipe here, but can cause different 
problems for other recipes:
1. client sends create ephemeral sequential (create is actually the only 
important detail here)
2. server fields request, commits node to a quorum, crashes before acking client
3. client sees a failed create and retries and succeeds

Since the client never went down, its heartbeats can have kept the session on 
the 1st node alive, but since the client doesn't know it owns that 1st node, if 
its the lowest number node in a leader election, no client will think they own 
the node, so all clients will wait on the effectively un-owned node and there 
will be no leader.  This is only fixed when the node is forcibly deleted or 
else this client finally expires its session.

In the context of the advertisement portion of a SingletonServer, we're 
probably OK without protection on the advertise nodes.  2 advertise nodes from 
the same leader will have the same endpoint data, so clients will find the 
server from either member of the 2+ member leader server set just fine.  When 
the leader is defeated its session will necessarily have timed out* and so both 
its leader and server set nodes will be wiped, cleaning up the dup server set 
node issue so that when the new leader gets elected there will not be a mixed 
server set.

*: However! If the leader nodes go away not from a session timeout but from 
forcible deletion, there could be a problem.  If the leader node were deleted 
only, then a new leader election would trigger and a new server could win, 
advertise, and now we'd have a serverset with 3 nodes, 2 pointing to the 
defeated leader, 1 to the new leader.


- John


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/46111/#review128788
-----------------------------------------------------------


On April 12, 2016, 3:21 p.m., John Sirois wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/46111/
> -----------------------------------------------------------
> 
> (Updated April 12, 2016, 3:21 p.m.)
> 
> 
> Review request for Aurora, Bill Farner and Zameer Manji.
> 
> 
> Bugs: AURORA-1468
>     https://issues.apache.org/jira/browse/AURORA-1468
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> commons/src/main/java/org/apache/aurora/common/zookeeper/SingletonService.java
>             |   4 ++
>  
> commons/src/main/java/org/apache/aurora/common/zookeeper/testing/ZooKeeperTestServer.java
>  |   2 +-
>  
> src/main/java/org/apache/aurora/scheduler/discovery/CuratorSingletonService.java
>           | 191 +++++++++++++++++++++++++++++++++++++++++++++++++
>  
> src/test/java/org/apache/aurora/scheduler/discovery/BaseCuratorDiscoveryTest.java
>          | 114 +++++++++++++++++++++++++++++
>  
> src/test/java/org/apache/aurora/scheduler/discovery/CuratorServiceGroupMonitorTest.java
>    | 108 +++++-----------------------
>  
> src/test/java/org/apache/aurora/scheduler/discovery/CuratorSingletonServiceTest.java
>       | 194 ++++++++++++++++++++++++++++++++++++++++++++++++++
>  6 files changed, 523 insertions(+), 90 deletions(-)
> 
> 
> Diffs
> -----
> 
>   
> commons/src/main/java/org/apache/aurora/common/zookeeper/SingletonService.java
>  3561d07f65d11060e4a96c9df06af44107a73430 
>   
> commons/src/main/java/org/apache/aurora/common/zookeeper/testing/ZooKeeperTestServer.java
>  0ab24faadcba60ff8e68b2036562e4d39a4ac874 
>   
> src/main/java/org/apache/aurora/scheduler/discovery/CuratorSingletonService.java
>  PRE-CREATION 
>   
> src/test/java/org/apache/aurora/scheduler/discovery/BaseCuratorDiscoveryTest.java
>  PRE-CREATION 
>   
> src/test/java/org/apache/aurora/scheduler/discovery/CuratorServiceGroupMonitorTest.java
>  559838984a519bdbbb9ed60a71cea726dcd5cb52 
>   
> src/test/java/org/apache/aurora/scheduler/discovery/CuratorSingletonServiceTest.java
>  PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/46111/diff/
> 
> 
> Testing
> -------
> 
> Locally green: `./gradlew -Pq build`.
> 
> NB: The SingletonServiceImplTest has more coverage and code surrounding 
> illegal state transitions - double advertisement, double leave, advertise 
> after leave (enforcing single-use).  I can add those sorts of checks and test 
> for the checks, but thought I'd leave this 1st RB here to reduce clutter and 
> follow-up if folks want those sorts of checks.
> 
> 
> Thanks,
> 
> John Sirois
> 
>

Reply via email to