Re: Leader election logging during reconfiguration

2019-07-30 Thread Michael Han
>> we should measure the total time more accurately

+1 - it would be good to have a new metric to measure reconfiguration time,
and leaving existing LE time metric dedicated to measure the conventional
FLE time. Mixing both (as of today) will provide some confusing insights on
how long the conventional FLE actually took.

On Mon, Jul 29, 2019 at 7:13 PM Alexander Shraer  wrote:

> Please see comments inline.
>
> Thanks,
> Alex
>
> On Mon, Jul 29, 2019 at 5:29 PM Karolos Antoniadis 
> wrote:
>
> > Hi ZooKeeper developers,
> >
> > ZooKeeper seems to be logging a "*LEADER ELECTION TOOK*" message even
> > though no leader election takes place during a reconfiguration.
> >
> > This can be reproduced by following these steps:
> > 1) start a ZooKeeper cluster (e.g., 3 participants)
> > 2) start a client that connects to some follower
> > 3) perform a *reconfig* operation that removes the leader from the
> cluster
> >
> > After the reconfiguration takes place, we can see that the log files of
> the
> > remaining participants contain a "*LEADER ELECTION TOOK*" message. For
> > example, a line that contains
> >
> > *2019-07-29 23:07:38,518 [myid:2] - INFO
> >  [QuorumPeer[myid=2](plain=0.0.0.0:2792)(secure=disabled):Follower@75] -
> > FOLLOWING - LEADER ELECTION TOOK - 57 MS*
> >
> > However, no leader election took place. With that, I mean that no server
> > went *LOOKING *and then started voting and sending notifications to other
> > participants as would be in a normal leader election.
> > It seems, that before the *reconfig *is committed, the participant that
> is
> > going to be the next leader is already decided (see here:
> >
> >
> https://github.com/apache/zookeeper/blob/master/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/Leader.java#L865
> > ).
> >
> > I think the above issue raises the following questions:
> > - Should we avoid logging LEADER ELECTION messages altogether in such
> > cases?
> >
>
> In the specific scenario you described, the leader has changed, but our
> heuristic for choosing the leader apparently worked and a new leader could
> be elected without running the full election.
> Notice that we could be unlucky and the designated leader could be offline,
> and then we'll fall back on election. It's useful to know how much time it
> takes to start following the new leader.
>
>
> > - Or, should there be some logging for the time it took for the
> > reconfiguration (e.g., the time between a participant gets a *reconfig*
> > operation till the operation is committed)? Would such a time value be
> > useful?
> >
>
> IIRC the LEADER ELECTION message is used for this purpose. if you just look
> on the time to commit the reconfig operation, you won't
> account for the work that happens when the commit message is received, such
> as leader re-election, role-change (follower->observer conversion and such)
> etc which is what takes most of the time.
> Committing a reconfig operation is usually not much more expensive than
> committing a normal operation. But perhaps you're right that we should
> measure the total time more accurately. Would you
> like to open a Jira and perhaps take a stab at improving this ?
>
> >
> > Best,
> > Karolos
> >
>


Re: Leader election logging during reconfiguration

2019-07-29 Thread Alexander Shraer
Please see comments inline.

Thanks,
Alex

On Mon, Jul 29, 2019 at 5:29 PM Karolos Antoniadis 
wrote:

> Hi ZooKeeper developers,
>
> ZooKeeper seems to be logging a "*LEADER ELECTION TOOK*" message even
> though no leader election takes place during a reconfiguration.
>
> This can be reproduced by following these steps:
> 1) start a ZooKeeper cluster (e.g., 3 participants)
> 2) start a client that connects to some follower
> 3) perform a *reconfig* operation that removes the leader from the cluster
>
> After the reconfiguration takes place, we can see that the log files of the
> remaining participants contain a "*LEADER ELECTION TOOK*" message. For
> example, a line that contains
>
> *2019-07-29 23:07:38,518 [myid:2] - INFO
>  [QuorumPeer[myid=2](plain=0.0.0.0:2792)(secure=disabled):Follower@75] -
> FOLLOWING - LEADER ELECTION TOOK - 57 MS*
>
> However, no leader election took place. With that, I mean that no server
> went *LOOKING *and then started voting and sending notifications to other
> participants as would be in a normal leader election.
> It seems, that before the *reconfig *is committed, the participant that is
> going to be the next leader is already decided (see here:
>
> https://github.com/apache/zookeeper/blob/master/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/Leader.java#L865
> ).
>
> I think the above issue raises the following questions:
> - Should we avoid logging LEADER ELECTION messages altogether in such
> cases?
>

In the specific scenario you described, the leader has changed, but our
heuristic for choosing the leader apparently worked and a new leader could
be elected without running the full election.
Notice that we could be unlucky and the designated leader could be offline,
and then we'll fall back on election. It's useful to know how much time it
takes to start following the new leader.


> - Or, should there be some logging for the time it took for the
> reconfiguration (e.g., the time between a participant gets a *reconfig*
> operation till the operation is committed)? Would such a time value be
> useful?
>

IIRC the LEADER ELECTION message is used for this purpose. if you just look
on the time to commit the reconfig operation, you won't
account for the work that happens when the commit message is received, such
as leader re-election, role-change (follower->observer conversion and such)
etc which is what takes most of the time.
Committing a reconfig operation is usually not much more expensive than
committing a normal operation. But perhaps you're right that we should
measure the total time more accurately. Would you
like to open a Jira and perhaps take a stab at improving this ?

>
> Best,
> Karolos
>


Re: Leader election

2018-12-12 Thread Michael Han
>> Can we reduce this time by configuring "syncLimit" and "tickTime" to
let's say 5 seconds? Can we have a strong
guarantee on this time bound?

It's not possible to guarantee the time bound, because of FLP impossibility
(reliable failure detection is not possible in async environment). Though
it's certainly possible to tune the parameters to some reasonable value
that fits your environment (which would be the SLA of your service).

>> As describe above - you might use 'sync'+'read' to avoid this problem.

I am afraid sync + read would not be correct 100% in all cases here. The
state of the world (e.g. leaders) could change between sync and read
operation. What we need here is linearizable read, which means we need have
read operations also go through the quorum consensus, which might be a nice
feature to have for ZooKeeper (for reference, etcd implements linearizable
read). Also, note ZooKeeper sync has bugs (sync should be a quorum
operation itself, but it's not implemented that way).

On Fri, Dec 7, 2018 at 2:11 AM Maciej Smoleński  wrote:

> On Fri, Dec 7, 2018 at 3:03 AM Michael Borokhovich 
> wrote:
>
> > We are planning to run Zookeeper nodes embedded with the client nodes.
> > I.e., each client runs also a ZK node. So, network partition will
> > disconnect a ZK node and not only the client.
> > My concern is about the following statement from the ZK documentation:
> >
> > "Timeliness: The clients view of the system is guaranteed to be
> up-to-date
> > within a certain time bound. (*On the order of tens of seconds.*) Either
> > system changes will be seen by a client within this bound, or the client
> > will detect a service outage."
> >
>
> This is related to the fact that ZooKeeper server handles reads from its
> local state - without communicating with other ZooKeeper servers.
> This design ensures scalability for read dominated workloads.
> In this approach client might receive data which is not up to date (it
> might not contain updates from other ZooKeeper servers (quorum)).
> Parameter 'syncLimit' describes how often ZooKeeper server
> synchronizes/updates its local state to global state.
> Client read operation will retrieve data from state not older then
> described by 'syncLimit'.
>
> However ZooKeeper client can always force to retrieve data which is up to
> date.
> It needs to issue 'sync' command to ZooKeeper server before issueing
> 'read'.
> With 'sync' ZooKeeper server with synchronize its local state with global
> state.
> Later 'read' will be handled from updated state.
> Client should be careful here - so that it communicates with the same
> ZooKeeper server for both 'sync' and 'read'.
>
>
> > What are these "*tens of seconds*"? Can we reduce this time by
> configuring
> > "syncLimit" and "tickTime" to let's say 5 seconds? Can we have a strong
> > guarantee on this time bound?
> >
>
> As describe above - you might use 'sync'+'read' to avoid this problem.
>
>
> >
> >
> > On Thu, Dec 6, 2018 at 1:05 PM Jordan Zimmerman <
> > jor...@jordanzimmerman.com>
> > wrote:
> >
> > > > Old service leader will detect network partition max 15 seconds after
> > it
> > > > happened.
> > >
> > > If the old service leader is in a very long GC it will not detect the
> > > partition. In the face of VM pauses, etc. it's not possible to avoid 2
> > > leaders for a short period of time.
> > >
> > > -JZ
> >
>


Re: Leader election

2018-12-11 Thread Michael Borokhovich
Thanks a lot for sharing the design, Ted. It is very helpful. Will check
what is applicable to our case and let you know in case of questions.

On Mon, Dec 10, 2018 at 23:37 Ted Dunning  wrote:

> One very useful way to deal with this is the method used in MapR FS. The
> idea is that ZK should only be used rarely and short periods of two leaders
> must be tolerated, but other data has to be written with absolute
> consistency.
>
> The method that we chose was to associate an epoch number with every write,
> require all writes to to all replicas and require that all replicas only
> acknowledge writes with their idea of the current epoch for an object.
>
> What happens in the even of partition is that we have a few possible cases,
> but in any case where data replicas are split by a partition, writes will
> fail triggering a new leader election. Only replicas on the side of the new
> ZK quorum (which may be the old quorum) have a chance of succeeding here.
> If the replicas are split away from the ZK quorum, writes may not be
> possible until the partition heals. If a new leader is elected, it will
> increment the epoch and form a replication chain out of the replicas it can
> find telling them about the new epoch. Writes can then proceed. During
> partition healing, any pending writes from the old epoch will be ignored by
> the current replicas. None of the writes to the new epoch will be directed
> to the old replicas after partition healing, but such writes should be
> ignored as well.
>
> In a side process, replicas that have come back after a partition may be
> updated with writes from the new replicas. If the partition lasts long
> enough, a new replica should be formed from the members of the current
> epoch. If a new replica is formed and an old one is resurrected, then the
> old one should probably be deprecated, although data balancing
> considerations may come into play.
>
> In the actual implementation of MapR FS, there is a lot of sophistication
> that does into the details, of course, and there is actually one more level
> of delegation that happens, but this outline is good enough for a lot of
> systems.
>
> The virtues of this system are multiple:
>
> 1) partition is detected exactly as soon as it affects a write. Detecting a
> partition sooner than that doesn't serve a lot of purpose, especially since
> the time to recover from a failed write is comparable to the duration of a
> fair number of partitions.
>
> 2) having an old master continue under false pretenses does no harm since
> it cannot write to a more recent replica chain. This is more important than
> it might seem since there can be situations where clocks don't necessarily
> advance at the expected rate so what seems like a short time can actually
> be much longer (Rip van Winkle failures).
>
> 3) forcing writes to all live replicas while allowing reorganization is
> actually very fast and as long as we can retain one live replica we can
> continue writing. This is in contrast to quorum systems where dropping
> below the quorum stops writes. This is important because the replicas of
> different objects can be arranged so that the portion of the cluster with a
> ZK quorum might not have a majority of replicas for some objects.
>
> 4) electing a new master of a replica chain can be done quite quickly so
> the duration of any degradation can be quite short (because you can set
> write timeouts fairly short because an unnecessary election takes less time
> than a long timeout)
>
> Anyway, you probably already have a design in mind. If this helps anyway,
> that's great.
>
> On Mon, Dec 10, 2018 at 10:32 PM Michael Borokhovich  >
> wrote:
>
> > Makes sense. Thanks, Ted. We will design our system to cope with the
> short
> > periods where we might have two leaders.
> >
> > On Thu, Dec 6, 2018 at 11:03 PM Ted Dunning 
> wrote:
> >
> > > ZK is able to guarantee that there is only one leader for the purposes
> of
> > > updating ZK data. That is because all commits have to originate with
> the
> > > current quorum leader and then be acknowledged by a quorum of the
> current
> > > cluster. IF the leader can't get enough acks, then it has de facto lost
> > > leadership.
> > >
> > > The problem comes when there is another system that depends on ZK data.
> > > Such data might record which node is the leader for some other
> purposes.
> > > That leader will only assume that they have become leader if they
> succeed
> > > in writing to ZK, but if there is a partition, then the old leader may
> > not
> > > be notified that another leader has established themselves until some
> > time
> > > after it has happened. Of course, if the erstwhile leader tried to
> > validate
> > > its position with a write to ZK, that write would fail, but you can't
> > spend
> > > 100% of your time doing that.
> > >
> > > it all comes down to the fact that a ZK client determines that it is
> > > connected to a ZK cluster member by pinging and that cluster member
> sees

Re: Leader election

2018-12-10 Thread Ted Dunning
One very useful way to deal with this is the method used in MapR FS. The
idea is that ZK should only be used rarely and short periods of two leaders
must be tolerated, but other data has to be written with absolute
consistency.

The method that we chose was to associate an epoch number with every write,
require all writes to to all replicas and require that all replicas only
acknowledge writes with their idea of the current epoch for an object.

What happens in the even of partition is that we have a few possible cases,
but in any case where data replicas are split by a partition, writes will
fail triggering a new leader election. Only replicas on the side of the new
ZK quorum (which may be the old quorum) have a chance of succeeding here.
If the replicas are split away from the ZK quorum, writes may not be
possible until the partition heals. If a new leader is elected, it will
increment the epoch and form a replication chain out of the replicas it can
find telling them about the new epoch. Writes can then proceed. During
partition healing, any pending writes from the old epoch will be ignored by
the current replicas. None of the writes to the new epoch will be directed
to the old replicas after partition healing, but such writes should be
ignored as well.

In a side process, replicas that have come back after a partition may be
updated with writes from the new replicas. If the partition lasts long
enough, a new replica should be formed from the members of the current
epoch. If a new replica is formed and an old one is resurrected, then the
old one should probably be deprecated, although data balancing
considerations may come into play.

In the actual implementation of MapR FS, there is a lot of sophistication
that does into the details, of course, and there is actually one more level
of delegation that happens, but this outline is good enough for a lot of
systems.

The virtues of this system are multiple:

1) partition is detected exactly as soon as it affects a write. Detecting a
partition sooner than that doesn't serve a lot of purpose, especially since
the time to recover from a failed write is comparable to the duration of a
fair number of partitions.

2) having an old master continue under false pretenses does no harm since
it cannot write to a more recent replica chain. This is more important than
it might seem since there can be situations where clocks don't necessarily
advance at the expected rate so what seems like a short time can actually
be much longer (Rip van Winkle failures).

3) forcing writes to all live replicas while allowing reorganization is
actually very fast and as long as we can retain one live replica we can
continue writing. This is in contrast to quorum systems where dropping
below the quorum stops writes. This is important because the replicas of
different objects can be arranged so that the portion of the cluster with a
ZK quorum might not have a majority of replicas for some objects.

4) electing a new master of a replica chain can be done quite quickly so
the duration of any degradation can be quite short (because you can set
write timeouts fairly short because an unnecessary election takes less time
than a long timeout)

Anyway, you probably already have a design in mind. If this helps anyway,
that's great.

On Mon, Dec 10, 2018 at 10:32 PM Michael Borokhovich 
wrote:

> Makes sense. Thanks, Ted. We will design our system to cope with the short
> periods where we might have two leaders.
>
> On Thu, Dec 6, 2018 at 11:03 PM Ted Dunning  wrote:
>
> > ZK is able to guarantee that there is only one leader for the purposes of
> > updating ZK data. That is because all commits have to originate with the
> > current quorum leader and then be acknowledged by a quorum of the current
> > cluster. IF the leader can't get enough acks, then it has de facto lost
> > leadership.
> >
> > The problem comes when there is another system that depends on ZK data.
> > Such data might record which node is the leader for some other purposes.
> > That leader will only assume that they have become leader if they succeed
> > in writing to ZK, but if there is a partition, then the old leader may
> not
> > be notified that another leader has established themselves until some
> time
> > after it has happened. Of course, if the erstwhile leader tried to
> validate
> > its position with a write to ZK, that write would fail, but you can't
> spend
> > 100% of your time doing that.
> >
> > it all comes down to the fact that a ZK client determines that it is
> > connected to a ZK cluster member by pinging and that cluster member sees
> > heartbeats from the leader. The fact is, though, that you can't tune
> these
> > pings to be faster than some level because you start to see lots of false
> > positives for loss of connection. Remember that it isn't just loss of
> > connection here that is the point. Any kind of delay would have the same
> > effect. Getting these ping intervals below one second makes for a 

Re: Leader election

2018-12-10 Thread Michael Borokhovich
Thanks, Maciej. That sounds good. We will try playing with the parameters
and have at least a known upper limit on the inconsistency interval.

On Fri, Dec 7, 2018 at 2:11 AM Maciej Smoleński  wrote:

> On Fri, Dec 7, 2018 at 3:03 AM Michael Borokhovich 
> wrote:
>
> > We are planning to run Zookeeper nodes embedded with the client nodes.
> > I.e., each client runs also a ZK node. So, network partition will
> > disconnect a ZK node and not only the client.
> > My concern is about the following statement from the ZK documentation:
> >
> > "Timeliness: The clients view of the system is guaranteed to be
> up-to-date
> > within a certain time bound. (*On the order of tens of seconds.*) Either
> > system changes will be seen by a client within this bound, or the client
> > will detect a service outage."
> >
>
> This is related to the fact that ZooKeeper server handles reads from its
> local state - without communicating with other ZooKeeper servers.
> This design ensures scalability for read dominated workloads.
> In this approach client might receive data which is not up to date (it
> might not contain updates from other ZooKeeper servers (quorum)).
> Parameter 'syncLimit' describes how often ZooKeeper server
> synchronizes/updates its local state to global state.
> Client read operation will retrieve data from state not older then
> described by 'syncLimit'.
>
> However ZooKeeper client can always force to retrieve data which is up to
> date.
> It needs to issue 'sync' command to ZooKeeper server before issueing
> 'read'.
> With 'sync' ZooKeeper server with synchronize its local state with global
> state.
> Later 'read' will be handled from updated state.
> Client should be careful here - so that it communicates with the same
> ZooKeeper server for both 'sync' and 'read'.
>
>
> > What are these "*tens of seconds*"? Can we reduce this time by
> configuring
> > "syncLimit" and "tickTime" to let's say 5 seconds? Can we have a strong
> > guarantee on this time bound?
> >
>
> As describe above - you might use 'sync'+'read' to avoid this problem.
>
>
> >
> >
> > On Thu, Dec 6, 2018 at 1:05 PM Jordan Zimmerman <
> > jor...@jordanzimmerman.com>
> > wrote:
> >
> > > > Old service leader will detect network partition max 15 seconds after
> > it
> > > > happened.
> > >
> > > If the old service leader is in a very long GC it will not detect the
> > > partition. In the face of VM pauses, etc. it's not possible to avoid 2
> > > leaders for a short period of time.
> > >
> > > -JZ
> >
>


Re: Leader election

2018-12-10 Thread Michael Borokhovich
Yes, I agree, our system should be able to tolerate two leaders for a short
and bounded period of time.
Thank you for the help!

On Thu, Dec 6, 2018 at 11:09 AM Jordan Zimmerman 
wrote:

> > it seems like the
> > inconsistency may be caused by the partition of the Zookeeper cluster
> > itself
>
> Yes - there are many ways in which you can end up with 2 leaders. However,
> if properly tuned and configured, it will be for a few seconds at most.
> During a GC pause no work is being done anyway. But, this stuff is very
> tricky. Requiring an atomically unique leader is actually a design smell
> and you should reconsider your architecture.
>
> > Maybe we can use a more
> > lightweight Hazelcast for example?
>
> There is no distributed system that can guarantee a single leader. Instead
> you need to adjust your design and algorithms to deal with this (using
> optimistic locking, etc.).
>
> -Jordan
>
> > On Dec 6, 2018, at 1:52 PM, Michael Borokhovich 
> wrote:
> >
> > Thanks Jordan,
> >
> > Yes, I will try Curator.
> > Also, beyond the problem described in the Tech Note, it seems like the
> > inconsistency may be caused by the partition of the Zookeeper cluster
> > itself. E.g., if a "leader" client is connected to the partitioned ZK
> node,
> > it may be notified not at the same time as the other clients connected to
> > the other ZK nodes. So, another client may take leadership while the
> > current leader still unaware of the change. Is it true?
> >
> > Another follow up question. If Zookeeper can guarantee a single leader,
> is
> > it worth using it just for leader election? Maybe we can use a more
> > lightweight Hazelcast for example?
> >
> > Michael.
> >
> >
> > On Thu, Dec 6, 2018 at 4:50 AM Jordan Zimmerman <
> jor...@jordanzimmerman.com>
> > wrote:
> >
> >> It is not possible to achieve the level of consistency you're after in
> an
> >> eventually consistent system such as ZooKeeper. There will always be an
> >> edge case where two ZooKeeper clients will believe they are leaders
> (though
> >> for a short period of time). In terms of how it affects Apache Curator,
> we
> >> have this Tech Note on the subject:
> >> https://cwiki.apache.org/confluence/display/CURATOR/TN10 <
> >> https://cwiki.apache.org/confluence/display/CURATOR/TN10> (the
> >> description is true for any ZooKeeper client, not just Curator
> clients). If
> >> you do still intend to use a ZooKeeper lock/leader I suggest you try
> Apache
> >> Curator as writing these "recipes" is not trivial and have many gotchas
> >> that aren't obvious.
> >>
> >> -Jordan
> >>
> >> http://curator.apache.org 
> >>
> >>
> >>> On Dec 5, 2018, at 6:20 PM, Michael Borokhovich 
> >> wrote:
> >>>
> >>> Hello,
> >>>
> >>> We have a service that runs on 3 hosts for high availability. However,
> at
> >>> any given time, exactly one instance must be active. So, we are
> thinking
> >> to
> >>> use Leader election using Zookeeper.
> >>> To this goal, on each service host we also start a ZK server, so we
> have
> >> a
> >>> 3-nodes ZK cluster and each service instance is a client to its
> dedicated
> >>> ZK server.
> >>> Then, we implement a leader election on top of Zookeeper using a basic
> >>> recipe:
> >>> https://zookeeper.apache.org/doc/r3.1.2/recipes.html#sc_leaderElection
> .
> >>>
> >>> I have the following questions doubts regarding the approach:
> >>>
> >>> 1. It seems like we can run into inconsistency issues when network
> >>> partition occurs. Zookeeper documentation says that the inconsistency
> >>> period may last “tens of seconds”. Am I understanding correctly that
> >> during
> >>> this time we may have 0 or 2 leaders?
> >>> 2. Is it possible to reduce this inconsistency time (let's say to 3
> >>> seconds) by tweaking tickTime and syncLimit parameters?
> >>> 3. Is there a way to guarantee exactly one leader all the time? Should
> we
> >>> implement a more complex leader election algorithm than the one
> suggested
> >>> in the recipe (using ephemeral_sequential nodes)?
> >>>
> >>> Thanks,
> >>> Michael.
> >>
> >>
>
>


Re: Leader election

2018-12-10 Thread Michael Borokhovich
Makes sense. Thanks, Ted. We will design our system to cope with the short
periods where we might have two leaders.

On Thu, Dec 6, 2018 at 11:03 PM Ted Dunning  wrote:

> ZK is able to guarantee that there is only one leader for the purposes of
> updating ZK data. That is because all commits have to originate with the
> current quorum leader and then be acknowledged by a quorum of the current
> cluster. IF the leader can't get enough acks, then it has de facto lost
> leadership.
>
> The problem comes when there is another system that depends on ZK data.
> Such data might record which node is the leader for some other purposes.
> That leader will only assume that they have become leader if they succeed
> in writing to ZK, but if there is a partition, then the old leader may not
> be notified that another leader has established themselves until some time
> after it has happened. Of course, if the erstwhile leader tried to validate
> its position with a write to ZK, that write would fail, but you can't spend
> 100% of your time doing that.
>
> it all comes down to the fact that a ZK client determines that it is
> connected to a ZK cluster member by pinging and that cluster member sees
> heartbeats from the leader. The fact is, though, that you can't tune these
> pings to be faster than some level because you start to see lots of false
> positives for loss of connection. Remember that it isn't just loss of
> connection here that is the point. Any kind of delay would have the same
> effect. Getting these ping intervals below one second makes for a very
> twitchy system.
>
>
>
> On Fri, Dec 7, 2018 at 11:03 AM Michael Borokhovich 
> wrote:
>
> > We are planning to run Zookeeper nodes embedded with the client nodes.
> > I.e., each client runs also a ZK node. So, network partition will
> > disconnect a ZK node and not only the client.
> > My concern is about the following statement from the ZK documentation:
> >
> > "Timeliness: The clients view of the system is guaranteed to be
> up-to-date
> > within a certain time bound. (*On the order of tens of seconds.*) Either
> > system changes will be seen by a client within this bound, or the client
> > will detect a service outage."
> >
> > What are these "*tens of seconds*"? Can we reduce this time by
> configuring
> > "syncLimit" and "tickTime" to let's say 5 seconds? Can we have a strong
> > guarantee on this time bound?
> >
> >
> > On Thu, Dec 6, 2018 at 1:05 PM Jordan Zimmerman <
> > jor...@jordanzimmerman.com>
> > wrote:
> >
> > > > Old service leader will detect network partition max 15 seconds after
> > it
> > > > happened.
> > >
> > > If the old service leader is in a very long GC it will not detect the
> > > partition. In the face of VM pauses, etc. it's not possible to avoid 2
> > > leaders for a short period of time.
> > >
> > > -JZ
> >
>


Re: Leader election

2018-12-07 Thread Maciej Smoleński
On Fri, Dec 7, 2018 at 3:03 AM Michael Borokhovich 
wrote:

> We are planning to run Zookeeper nodes embedded with the client nodes.
> I.e., each client runs also a ZK node. So, network partition will
> disconnect a ZK node and not only the client.
> My concern is about the following statement from the ZK documentation:
>
> "Timeliness: The clients view of the system is guaranteed to be up-to-date
> within a certain time bound. (*On the order of tens of seconds.*) Either
> system changes will be seen by a client within this bound, or the client
> will detect a service outage."
>

This is related to the fact that ZooKeeper server handles reads from its
local state - without communicating with other ZooKeeper servers.
This design ensures scalability for read dominated workloads.
In this approach client might receive data which is not up to date (it
might not contain updates from other ZooKeeper servers (quorum)).
Parameter 'syncLimit' describes how often ZooKeeper server
synchronizes/updates its local state to global state.
Client read operation will retrieve data from state not older then
described by 'syncLimit'.

However ZooKeeper client can always force to retrieve data which is up to
date.
It needs to issue 'sync' command to ZooKeeper server before issueing 'read'.
With 'sync' ZooKeeper server with synchronize its local state with global
state.
Later 'read' will be handled from updated state.
Client should be careful here - so that it communicates with the same
ZooKeeper server for both 'sync' and 'read'.


> What are these "*tens of seconds*"? Can we reduce this time by configuring
> "syncLimit" and "tickTime" to let's say 5 seconds? Can we have a strong
> guarantee on this time bound?
>

As describe above - you might use 'sync'+'read' to avoid this problem.


>
>
> On Thu, Dec 6, 2018 at 1:05 PM Jordan Zimmerman <
> jor...@jordanzimmerman.com>
> wrote:
>
> > > Old service leader will detect network partition max 15 seconds after
> it
> > > happened.
> >
> > If the old service leader is in a very long GC it will not detect the
> > partition. In the face of VM pauses, etc. it's not possible to avoid 2
> > leaders for a short period of time.
> >
> > -JZ
>


Re: Leader election

2018-12-06 Thread Ted Dunning
ZK is able to guarantee that there is only one leader for the purposes of
updating ZK data. That is because all commits have to originate with the
current quorum leader and then be acknowledged by a quorum of the current
cluster. IF the leader can't get enough acks, then it has de facto lost
leadership.

The problem comes when there is another system that depends on ZK data.
Such data might record which node is the leader for some other purposes.
That leader will only assume that they have become leader if they succeed
in writing to ZK, but if there is a partition, then the old leader may not
be notified that another leader has established themselves until some time
after it has happened. Of course, if the erstwhile leader tried to validate
its position with a write to ZK, that write would fail, but you can't spend
100% of your time doing that.

it all comes down to the fact that a ZK client determines that it is
connected to a ZK cluster member by pinging and that cluster member sees
heartbeats from the leader. The fact is, though, that you can't tune these
pings to be faster than some level because you start to see lots of false
positives for loss of connection. Remember that it isn't just loss of
connection here that is the point. Any kind of delay would have the same
effect. Getting these ping intervals below one second makes for a very
twitchy system.



On Fri, Dec 7, 2018 at 11:03 AM Michael Borokhovich 
wrote:

> We are planning to run Zookeeper nodes embedded with the client nodes.
> I.e., each client runs also a ZK node. So, network partition will
> disconnect a ZK node and not only the client.
> My concern is about the following statement from the ZK documentation:
>
> "Timeliness: The clients view of the system is guaranteed to be up-to-date
> within a certain time bound. (*On the order of tens of seconds.*) Either
> system changes will be seen by a client within this bound, or the client
> will detect a service outage."
>
> What are these "*tens of seconds*"? Can we reduce this time by configuring
> "syncLimit" and "tickTime" to let's say 5 seconds? Can we have a strong
> guarantee on this time bound?
>
>
> On Thu, Dec 6, 2018 at 1:05 PM Jordan Zimmerman <
> jor...@jordanzimmerman.com>
> wrote:
>
> > > Old service leader will detect network partition max 15 seconds after
> it
> > > happened.
> >
> > If the old service leader is in a very long GC it will not detect the
> > partition. In the face of VM pauses, etc. it's not possible to avoid 2
> > leaders for a short period of time.
> >
> > -JZ
>


Re: Leader election

2018-12-06 Thread Michael Borokhovich
We are planning to run Zookeeper nodes embedded with the client nodes.
I.e., each client runs also a ZK node. So, network partition will
disconnect a ZK node and not only the client.
My concern is about the following statement from the ZK documentation:

"Timeliness: The clients view of the system is guaranteed to be up-to-date
within a certain time bound. (*On the order of tens of seconds.*) Either
system changes will be seen by a client within this bound, or the client
will detect a service outage."

What are these "*tens of seconds*"? Can we reduce this time by configuring
"syncLimit" and "tickTime" to let's say 5 seconds? Can we have a strong
guarantee on this time bound?


On Thu, Dec 6, 2018 at 1:05 PM Jordan Zimmerman 
wrote:

> > Old service leader will detect network partition max 15 seconds after it
> > happened.
>
> If the old service leader is in a very long GC it will not detect the
> partition. In the face of VM pauses, etc. it's not possible to avoid 2
> leaders for a short period of time.
>
> -JZ


Re: Leader election

2018-12-06 Thread Michael Han
Tweak timeout is tempting as your solution might work most of the time yet
fail in certain cases (which others have pointed out). If the goal is
absolute correctness then we should avoid timeout, which does not guarantee
correctness as it only makes the problem hard to manifest. Fencing is the
right solution here - the zxid and also znode cversion can be used as
fencing token if you use ZooKeeper. Fencing will guarantee at any single
point in time you will have one active leader in action (it does not
guarantee that at a single point of time there are multiple parties *think*
they are the leader). An alternative solution, depends on your use case, is
to instead of requiring a single active leader in action at any time, make
your workload idempotent so multiple active leaders don't do any damage.

On Thu, Dec 6, 2018 at 1:05 PM Jordan Zimmerman 
wrote:

> > Old service leader will detect network partition max 15 seconds after it
> > happened.
>
> If the old service leader is in a very long GC it will not detect the
> partition. In the face of VM pauses, etc. it's not possible to avoid 2
> leaders for a short period of time.
>
> -JZ


Re: Leader election

2018-12-06 Thread Jordan Zimmerman
> Old service leader will detect network partition max 15 seconds after it
> happened.

If the old service leader is in a very long GC it will not detect the 
partition. In the face of VM pauses, etc. it's not possible to avoid 2 leaders 
for a short period of time.

-JZ

Re: Leader election

2018-12-06 Thread Maciej Smoleński
Hello,

Ensuring reliability requires to use consensus directly in your service or
change the service to use distributed log/journal (e.g. bookkeeper).

However following idea is simple and in many situation good enough.
If you configure session timeout to 15 seconds - then zookeeper client will
be disconnected when partitioned - after max 15 seconds.
Old service leader will detect network partition max 15 seconds after it
happened.
The new service leader should be idle for initial 15+ seconds (let's say 30
seconds).
In this way you avoid situation with 2 concurrently working leaders.

Described solution has time dependencies and in some situations leads to
incorrect state e.g.:
High load on machine might cause that zookeeper client will detect
disconnection after 60 seconds (instead of expected 15 seconds). In such
situation there will be 2 concurrent leaders.

Maciej




On Thu, Dec 6, 2018 at 8:09 PM Jordan Zimmerman 
wrote:

> > it seems like the
> > inconsistency may be caused by the partition of the Zookeeper cluster
> > itself
>
> Yes - there are many ways in which you can end up with 2 leaders. However,
> if properly tuned and configured, it will be for a few seconds at most.
> During a GC pause no work is being done anyway. But, this stuff is very
> tricky. Requiring an atomically unique leader is actually a design smell
> and you should reconsider your architecture.
>
> > Maybe we can use a more
> > lightweight Hazelcast for example?
>
> There is no distributed system that can guarantee a single leader. Instead
> you need to adjust your design and algorithms to deal with this (using
> optimistic locking, etc.).
>
> -Jordan
>
> > On Dec 6, 2018, at 1:52 PM, Michael Borokhovich 
> wrote:
> >
> > Thanks Jordan,
> >
> > Yes, I will try Curator.
> > Also, beyond the problem described in the Tech Note, it seems like the
> > inconsistency may be caused by the partition of the Zookeeper cluster
> > itself. E.g., if a "leader" client is connected to the partitioned ZK
> node,
> > it may be notified not at the same time as the other clients connected to
> > the other ZK nodes. So, another client may take leadership while the
> > current leader still unaware of the change. Is it true?
> >
> > Another follow up question. If Zookeeper can guarantee a single leader,
> is
> > it worth using it just for leader election? Maybe we can use a more
> > lightweight Hazelcast for example?
> >
> > Michael.
> >
> >
> > On Thu, Dec 6, 2018 at 4:50 AM Jordan Zimmerman <
> jor...@jordanzimmerman.com>
> > wrote:
> >
> >> It is not possible to achieve the level of consistency you're after in
> an
> >> eventually consistent system such as ZooKeeper. There will always be an
> >> edge case where two ZooKeeper clients will believe they are leaders
> (though
> >> for a short period of time). In terms of how it affects Apache Curator,
> we
> >> have this Tech Note on the subject:
> >> https://cwiki.apache.org/confluence/display/CURATOR/TN10 <
> >> https://cwiki.apache.org/confluence/display/CURATOR/TN10> (the
> >> description is true for any ZooKeeper client, not just Curator
> clients). If
> >> you do still intend to use a ZooKeeper lock/leader I suggest you try
> Apache
> >> Curator as writing these "recipes" is not trivial and have many gotchas
> >> that aren't obvious.
> >>
> >> -Jordan
> >>
> >> http://curator.apache.org 
> >>
> >>
> >>> On Dec 5, 2018, at 6:20 PM, Michael Borokhovich 
> >> wrote:
> >>>
> >>> Hello,
> >>>
> >>> We have a service that runs on 3 hosts for high availability. However,
> at
> >>> any given time, exactly one instance must be active. So, we are
> thinking
> >> to
> >>> use Leader election using Zookeeper.
> >>> To this goal, on each service host we also start a ZK server, so we
> have
> >> a
> >>> 3-nodes ZK cluster and each service instance is a client to its
> dedicated
> >>> ZK server.
> >>> Then, we implement a leader election on top of Zookeeper using a basic
> >>> recipe:
> >>> https://zookeeper.apache.org/doc/r3.1.2/recipes.html#sc_leaderElection
> .
> >>>
> >>> I have the following questions doubts regarding the approach:
> >>>
> >>> 1. It seems like we can run into inconsistency issues when network
> >>> partition occurs. Zookeeper documentation says that the inconsistency
> >>> period may last “tens of seconds”. Am I understanding correctly that
> >> during
> >>> this time we may have 0 or 2 leaders?
> >>> 2. Is it possible to reduce this inconsistency time (let's say to 3
> >>> seconds) by tweaking tickTime and syncLimit parameters?
> >>> 3. Is there a way to guarantee exactly one leader all the time? Should
> we
> >>> implement a more complex leader election algorithm than the one
> suggested
> >>> in the recipe (using ephemeral_sequential nodes)?
> >>>
> >>> Thanks,
> >>> Michael.
> >>
> >>
>
>


Re: Leader election

2018-12-06 Thread Jordan Zimmerman
> it seems like the
> inconsistency may be caused by the partition of the Zookeeper cluster
> itself

Yes - there are many ways in which you can end up with 2 leaders. However, if 
properly tuned and configured, it will be for a few seconds at most. During a 
GC pause no work is being done anyway. But, this stuff is very tricky. 
Requiring an atomically unique leader is actually a design smell and you should 
reconsider your architecture.

> Maybe we can use a more
> lightweight Hazelcast for example?

There is no distributed system that can guarantee a single leader. Instead you 
need to adjust your design and algorithms to deal with this (using optimistic 
locking, etc.).

-Jordan

> On Dec 6, 2018, at 1:52 PM, Michael Borokhovich  wrote:
> 
> Thanks Jordan,
> 
> Yes, I will try Curator.
> Also, beyond the problem described in the Tech Note, it seems like the
> inconsistency may be caused by the partition of the Zookeeper cluster
> itself. E.g., if a "leader" client is connected to the partitioned ZK node,
> it may be notified not at the same time as the other clients connected to
> the other ZK nodes. So, another client may take leadership while the
> current leader still unaware of the change. Is it true?
> 
> Another follow up question. If Zookeeper can guarantee a single leader, is
> it worth using it just for leader election? Maybe we can use a more
> lightweight Hazelcast for example?
> 
> Michael.
> 
> 
> On Thu, Dec 6, 2018 at 4:50 AM Jordan Zimmerman 
> wrote:
> 
>> It is not possible to achieve the level of consistency you're after in an
>> eventually consistent system such as ZooKeeper. There will always be an
>> edge case where two ZooKeeper clients will believe they are leaders (though
>> for a short period of time). In terms of how it affects Apache Curator, we
>> have this Tech Note on the subject:
>> https://cwiki.apache.org/confluence/display/CURATOR/TN10 <
>> https://cwiki.apache.org/confluence/display/CURATOR/TN10> (the
>> description is true for any ZooKeeper client, not just Curator clients). If
>> you do still intend to use a ZooKeeper lock/leader I suggest you try Apache
>> Curator as writing these "recipes" is not trivial and have many gotchas
>> that aren't obvious.
>> 
>> -Jordan
>> 
>> http://curator.apache.org 
>> 
>> 
>>> On Dec 5, 2018, at 6:20 PM, Michael Borokhovich 
>> wrote:
>>> 
>>> Hello,
>>> 
>>> We have a service that runs on 3 hosts for high availability. However, at
>>> any given time, exactly one instance must be active. So, we are thinking
>> to
>>> use Leader election using Zookeeper.
>>> To this goal, on each service host we also start a ZK server, so we have
>> a
>>> 3-nodes ZK cluster and each service instance is a client to its dedicated
>>> ZK server.
>>> Then, we implement a leader election on top of Zookeeper using a basic
>>> recipe:
>>> https://zookeeper.apache.org/doc/r3.1.2/recipes.html#sc_leaderElection.
>>> 
>>> I have the following questions doubts regarding the approach:
>>> 
>>> 1. It seems like we can run into inconsistency issues when network
>>> partition occurs. Zookeeper documentation says that the inconsistency
>>> period may last “tens of seconds”. Am I understanding correctly that
>> during
>>> this time we may have 0 or 2 leaders?
>>> 2. Is it possible to reduce this inconsistency time (let's say to 3
>>> seconds) by tweaking tickTime and syncLimit parameters?
>>> 3. Is there a way to guarantee exactly one leader all the time? Should we
>>> implement a more complex leader election algorithm than the one suggested
>>> in the recipe (using ephemeral_sequential nodes)?
>>> 
>>> Thanks,
>>> Michael.
>> 
>> 



Re: Leader election

2018-12-06 Thread Michael Borokhovich
Thanks Jordan,

Yes, I will try Curator.
Also, beyond the problem described in the Tech Note, it seems like the
inconsistency may be caused by the partition of the Zookeeper cluster
itself. E.g., if a "leader" client is connected to the partitioned ZK node,
it may be notified not at the same time as the other clients connected to
the other ZK nodes. So, another client may take leadership while the
current leader still unaware of the change. Is it true?

Another follow up question. If Zookeeper can guarantee a single leader, is
it worth using it just for leader election? Maybe we can use a more
lightweight Hazelcast for example?

Michael.


On Thu, Dec 6, 2018 at 4:50 AM Jordan Zimmerman 
wrote:

> It is not possible to achieve the level of consistency you're after in an
> eventually consistent system such as ZooKeeper. There will always be an
> edge case where two ZooKeeper clients will believe they are leaders (though
> for a short period of time). In terms of how it affects Apache Curator, we
> have this Tech Note on the subject:
> https://cwiki.apache.org/confluence/display/CURATOR/TN10 <
> https://cwiki.apache.org/confluence/display/CURATOR/TN10> (the
> description is true for any ZooKeeper client, not just Curator clients). If
> you do still intend to use a ZooKeeper lock/leader I suggest you try Apache
> Curator as writing these "recipes" is not trivial and have many gotchas
> that aren't obvious.
>
> -Jordan
>
> http://curator.apache.org 
>
>
> > On Dec 5, 2018, at 6:20 PM, Michael Borokhovich 
> wrote:
> >
> > Hello,
> >
> > We have a service that runs on 3 hosts for high availability. However, at
> > any given time, exactly one instance must be active. So, we are thinking
> to
> > use Leader election using Zookeeper.
> > To this goal, on each service host we also start a ZK server, so we have
> a
> > 3-nodes ZK cluster and each service instance is a client to its dedicated
> > ZK server.
> > Then, we implement a leader election on top of Zookeeper using a basic
> > recipe:
> > https://zookeeper.apache.org/doc/r3.1.2/recipes.html#sc_leaderElection.
> >
> > I have the following questions doubts regarding the approach:
> >
> > 1. It seems like we can run into inconsistency issues when network
> > partition occurs. Zookeeper documentation says that the inconsistency
> > period may last “tens of seconds”. Am I understanding correctly that
> during
> > this time we may have 0 or 2 leaders?
> > 2. Is it possible to reduce this inconsistency time (let's say to 3
> > seconds) by tweaking tickTime and syncLimit parameters?
> > 3. Is there a way to guarantee exactly one leader all the time? Should we
> > implement a more complex leader election algorithm than the one suggested
> > in the recipe (using ephemeral_sequential nodes)?
> >
> > Thanks,
> > Michael.
>
>


Re: Leader election

2018-12-06 Thread Jordan Zimmerman
It is not possible to achieve the level of consistency you're after in an 
eventually consistent system such as ZooKeeper. There will always be an edge 
case where two ZooKeeper clients will believe they are leaders (though for a 
short period of time). In terms of how it affects Apache Curator, we have this 
Tech Note on the subject: 
https://cwiki.apache.org/confluence/display/CURATOR/TN10 
 (the description is 
true for any ZooKeeper client, not just Curator clients). If you do still 
intend to use a ZooKeeper lock/leader I suggest you try Apache Curator as 
writing these "recipes" is not trivial and have many gotchas that aren't 
obvious. 

-Jordan

http://curator.apache.org 


> On Dec 5, 2018, at 6:20 PM, Michael Borokhovich  wrote:
> 
> Hello,
> 
> We have a service that runs on 3 hosts for high availability. However, at
> any given time, exactly one instance must be active. So, we are thinking to
> use Leader election using Zookeeper.
> To this goal, on each service host we also start a ZK server, so we have a
> 3-nodes ZK cluster and each service instance is a client to its dedicated
> ZK server.
> Then, we implement a leader election on top of Zookeeper using a basic
> recipe:
> https://zookeeper.apache.org/doc/r3.1.2/recipes.html#sc_leaderElection.
> 
> I have the following questions doubts regarding the approach:
> 
> 1. It seems like we can run into inconsistency issues when network
> partition occurs. Zookeeper documentation says that the inconsistency
> period may last “tens of seconds”. Am I understanding correctly that during
> this time we may have 0 or 2 leaders?
> 2. Is it possible to reduce this inconsistency time (let's say to 3
> seconds) by tweaking tickTime and syncLimit parameters?
> 3. Is there a way to guarantee exactly one leader all the time? Should we
> implement a more complex leader election algorithm than the one suggested
> in the recipe (using ephemeral_sequential nodes)?
> 
> Thanks,
> Michael.



回复:Re: Leader election

2018-12-06 Thread 毛蛤丝
---> "Can it happen that we end up with 2 leaders or 0 leader for some period of
time (for example, during network delays/partitions)?"
look at the code:
https://github.com/apache/curator/blob/master/curator-recipes/src/main/java/org/apache/curator/framework/recipes/leader/LeaderSelector.java#L340
it can guarantee exactly one leader all the time(EPHEMERAL_SEQUENTIAL zk-node) 
which has not too much correlations with the network partitions of zk ensembles 
itself.
I guess,haha!
- 原始邮件 -
发件人:Michael Borokhovich 
收件人:dev@zookeeper.apache.org, maoling199210...@sina.com
主题:Re: Leader election
日期:2018年12月06日 15点18分

Thanks, I will check it out.
However, do you know if it gives any better guarantees?
Can it happen that we end up with 2 leaders or 0 leader for some period of
time (for example, during network delays/partitions)?
On Wed, Dec 5, 2018 at 10:54 PM 毛蛤丝  wrote:
> suggest you use the ready-made implements of curator:
> http://curator.apache.org/curator-recipes/leader-election.html
> - 原始邮件 -
> 发件人:Michael Borokhovich 
> 收件人:"dev@zookeeper.apache.org" 
> 主题:Leader election
> 日期:2018年12月06日 07点29分
>
> Hello,
> We have a service that runs on 3 hosts for high availability. However, at
> any given time, exactly one instance must be active. So, we are thinking to
> use Leader election using Zookeeper.
> To this goal, on each service host we also start a ZK server, so we have a
> 3-nodes ZK cluster and each service instance is a client to its dedicated
> ZK server.
> Then, we implement a leader election on top of Zookeeper using a basic
> recipe:
> https://zookeeper.apache.org/doc/r3.1.2/recipes.html#sc_leaderElection.
> I have the following questions doubts regarding the approach:
> 1. It seems like we can run into inconsistency issues when network
> partition occurs. Zookeeper documentation says that the inconsistency
> period may last “tens of seconds”. Am I understanding correctly that during
> this time we may have 0 or 2 leaders?
> 2. Is it possible to reduce this inconsistency time (let's say to 3
> seconds) by tweaking tickTime and syncLimit parameters?
> 3. Is there a way to guarantee exactly one leader all the time? Should we
> implement a more complex leader election algorithm than the one suggested
> in the recipe (using ephemeral_sequential nodes)?
> Thanks,
> Michael.
>


Re: Leader election

2018-12-05 Thread Enrico Olivelli
Michael,
Leader election is not enough.
You must have some mechanism to fence off the partitioned leader.

If you are building a replicated state machine Apache Zookeeper + Apache
Bookkeeper can be a good choice
See this just an example:
https://github.com/ivankelly/bookkeeper-tutorial

This is the "bible" for ZooKeepers and it describes how to build such
systems and the importance of "fencing"
https://www.amazon.com/ZooKeeper-Distributed-Coordination-Flavio-Junqueira-ebook/dp/B00GRCODKS

If you are interested in BookKeeper ping on user@ Apache BookKeeper mailing
list

Enrico




Il gio 6 dic 2018, 08:18 Michael Borokhovich  ha
scritto:

> Thanks, I will check it out.
> However, do you know if it gives any better guarantees?
> Can it happen that we end up with 2 leaders or 0 leader for some period of
> time (for example, during network delays/partitions)?
>
>
>
> On Wed, Dec 5, 2018 at 10:54 PM 毛蛤丝  wrote:
>
> > suggest you use the ready-made implements of curator:
> > http://curator.apache.org/curator-recipes/leader-election.html
> > - 原始邮件 -
> > 发件人:Michael Borokhovich 
> > 收件人:"dev@zookeeper.apache.org" 
> > 主题:Leader election
> > 日期:2018年12月06日 07点29分
> >
> > Hello,
> > We have a service that runs on 3 hosts for high availability. However, at
> > any given time, exactly one instance must be active. So, we are thinking
> to
> > use Leader election using Zookeeper.
> > To this goal, on each service host we also start a ZK server, so we have
> a
> > 3-nodes ZK cluster and each service instance is a client to its dedicated
> > ZK server.
> > Then, we implement a leader election on top of Zookeeper using a basic
> > recipe:
> > https://zookeeper.apache.org/doc/r3.1.2/recipes.html#sc_leaderElection.
> > I have the following questions doubts regarding the approach:
> > 1. It seems like we can run into inconsistency issues when network
> > partition occurs. Zookeeper documentation says that the inconsistency
> > period may last “tens of seconds”. Am I understanding correctly that
> during
> > this time we may have 0 or 2 leaders?
> > 2. Is it possible to reduce this inconsistency time (let's say to 3
> > seconds) by tweaking tickTime and syncLimit parameters?
> > 3. Is there a way to guarantee exactly one leader all the time? Should we
> > implement a more complex leader election algorithm than the one suggested
> > in the recipe (using ephemeral_sequential nodes)?
> > Thanks,
> > Michael.
> >
>


Re: Leader election

2018-12-05 Thread Michael Borokhovich
Thanks, I will check it out.
However, do you know if it gives any better guarantees?
Can it happen that we end up with 2 leaders or 0 leader for some period of
time (for example, during network delays/partitions)?



On Wed, Dec 5, 2018 at 10:54 PM 毛蛤丝  wrote:

> suggest you use the ready-made implements of curator:
> http://curator.apache.org/curator-recipes/leader-election.html
> - 原始邮件 -
> 发件人:Michael Borokhovich 
> 收件人:"dev@zookeeper.apache.org" 
> 主题:Leader election
> 日期:2018年12月06日 07点29分
>
> Hello,
> We have a service that runs on 3 hosts for high availability. However, at
> any given time, exactly one instance must be active. So, we are thinking to
> use Leader election using Zookeeper.
> To this goal, on each service host we also start a ZK server, so we have a
> 3-nodes ZK cluster and each service instance is a client to its dedicated
> ZK server.
> Then, we implement a leader election on top of Zookeeper using a basic
> recipe:
> https://zookeeper.apache.org/doc/r3.1.2/recipes.html#sc_leaderElection.
> I have the following questions doubts regarding the approach:
> 1. It seems like we can run into inconsistency issues when network
> partition occurs. Zookeeper documentation says that the inconsistency
> period may last “tens of seconds”. Am I understanding correctly that during
> this time we may have 0 or 2 leaders?
> 2. Is it possible to reduce this inconsistency time (let's say to 3
> seconds) by tweaking tickTime and syncLimit parameters?
> 3. Is there a way to guarantee exactly one leader all the time? Should we
> implement a more complex leader election algorithm than the one suggested
> in the recipe (using ephemeral_sequential nodes)?
> Thanks,
> Michael.
>