I need to also mention ZOOKEEPER-1549 in the context of point (2) below. That's 
a blocker for 3.5.0. 

-Flavio

On Jul 17, 2013, at 12:30 PM, Flavio Junqueira <[email protected]> wrote:

> Moving the discussion to dev but keeping user on CC.
> 
> Let's step back. The reason why we started the latest discussion in this 
> thread was because Kishore is concerned about recovery time. There are a 
> number of improvements we have been looking at for the next release, let me 
> go over my current understanding of the main points that add to the recovery 
> time:
> 
> 1- Before we even start leader election, each server loads state from disk to 
> determine its last zxid. The last zxid is used in the election;
> 2- Once the leader is elected, it loads state from disk and take a snapshot. 
> Loading the database again is unecessary (ZOOKEEPER-1642) and the snapshot 
> adds latency. In fact, it is not even correct to have it there 
> (ZOOKEEPER-1558).
> 3- A follower takes a snapshot before acknowledging the NEWLEADER message, so 
> the leader has to wait until a quorum of followers finishes their snapshot.
> 
> The proposal I've heard here is to touch (1). For now, I'd rather keep (1) as 
> is and focus on fixing (2). We might be able to do something about (3) and 
> I'm actually not sure if there has been a discussion about it or not.
> 
> -Flavio
> 
> On Jul 17, 2013, at 5:40 AM, Thawan Kooburat <[email protected]> wrote:
> 
>> Client will get session expire event only when a server explicitly tells
>> the client. So any established sessions will remain in a disconnected
>> state during the period
>> 
>> So my comment about the need for longer session timeout might be
>> incorrect. While the quorum is down during leader election, session won't
>> expire during this period. When the quorum comes back, the client have to
>> reconnect within session timeout in order to resume the session.  However,
>> client won't be able to issue any read/write request or create a new
>> session while the quorum is down.
>> 
>> However, some application may need a stronger consistency guarantee. They
>> will have a special logic to abort the client if it was disconnected for
>> an extended period. This is because the client won't be able to tell if
>> the quorum is down or there is a network partition between the client and
>> the quorum. 
>> 
>> 
>> -- 
>> Thawan Kooburat
>> 
>> 
>> 
>> 
>> 
>> On 7/16/13 6:46 PM, "kishore g" <[email protected]> wrote:
>> 
>>> Thanks Thawan. Another question to follow up, so lets say client c1 is
>>> connected to leader and leader fails. Now c1 is trying to connect to
>>> another zk server but all servers are busy loading snapshot and can take a
>>> minute or two. According to Flavio zk servers dont accept any request
>>> while
>>> synchronization, but most clients dont keep that high connection timeout.
>>> So does this mean clients will timeout on connection?. Is my understanding
>>> correct or zk servers will accept connection requests but reject
>>> read/write
>>> requests.
>>> 
>>> thanks,
>>> Kishore G
>>> 
>>> 
>>> On Tue, Jul 16, 2013 at 3:45 PM, Thawan Kooburat <[email protected]> wrote:
>>> 
>>>> There is a plan to work on this optimization ZOOKEEPER-1674.
>>>> 
>>>> 
>>>> --
>>>> Thawan Kooburat
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On 7/16/13 1:37 PM, "kishore g" <[email protected]> wrote:
>>>> 
>>>>> All servers in the quorum reading the snapshot from disk as part of the
>>>>> synchronization phase. From Thawan's email it looks like when ever
>>>> there
>>>>> is
>>>>> a leader election, all zk servers read the snapshot from disk. I am not
>>>>> sure why all servers should reload the snapshot from disk as this
>>>>> increases
>>>>> unavailability time.
>>>>> 
>>>>> 
>>>>> On Tue, Jul 16, 2013 at 12:35 PM, Flavio Junqueira
>>>>> <[email protected]>wrote:
>>>>> 
>>>>>> The synchronization phase is part of the protocol and we use it to
>>>>>> guarantee that we expose a consistent view of the state. During the
>>>>>> synchronization phase, servers do not accept requests.
>>>>>> 
>>>>>> Which behavior are you proposing we change, Kishore?
>>>>>> 
>>>>>> -Flavio
>>>>>> 
>>>>>> On Jul 16, 2013, at 7:04 PM, kishore g <[email protected]> wrote:
>>>>>> 
>>>>>>> Thanks for clarification Flavio. Does this mean during the leader
>>>>>> election,
>>>>>>> both reads and writes are not supported?. Do we start a separate
>>>>>>> thread/jira of changing this behavior?.
>>>>>>> 
>>>>>>> thanks,
>>>>>>> Kishore G
>>>>>>> 
>>>>>>> 
>>>>>>> On Tue, Jul 16, 2013 at 9:16 AM, Flavio Junqueira
>>>>>> <[email protected]
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> The disk state should be the authoritative state of a server, so
>>>> if I
>>>>>>>> remember correctly, we load the database as a way of validating
>>>> the
>>>>>> disk
>>>>>>>> state. I don't claim that this is strictly necessary, but if we
>>>> are
>>>>>> to
>>>>>>>> change it, then I would need to think this through.
>>>>>>>> 
>>>>>>>> About leader election, if a leader loses support from a quorum of
>>>>>>>> followers,
>>>>>>>> then it will drop leadership. Any event that causes a follower to
>>>>>> stop
>>>>>>>> receiving messages from the leader or the follower to disconnect
>>>> from
>>>>>> the
>>>>>>>> leader will make it stop supporting the current leader.
>>>>>>>> 
>>>>>>>> -Flavio
>>>>>>>> 
>>>>>>>> -----Original Message-----
>>>>>>>> From: Sergey Maslyakov [mailto:[email protected]]
>>>>>>>> Sent: 16 July 2013 16:16
>>>>>>>> To: [email protected]
>>>>>>>> Subject: Re: Maximum size of a snapshot
>>>>>>>> 
>>>>>>>> And another extension on top of Kishore's question: do the
>>>>>> reelections
>>>>>>>> happen if the previously elected leader remains in the cluster? In
>>>>>> other
>>>>>>>> words, what events can trigger re-election and the corresponding
>>>>>> temporary
>>>>>>>> degradation of the service provided by Zookeeper?
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Thank you,
>>>>>>>> /Sergey
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Tue, Jul 16, 2013 at 2:21 AM, kishore g <[email protected]>
>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Regarding #2. Is that really true that during leader election
>>>> every
>>>>>>>>> machine reloads snapshot data from disk? Any reason why this is
>>>>>> needed
>>>>>>>>> unless it really needs to truncate or undo conflicting
>>>> transactions
>>>>>>>> already applied?
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Mon, Jul 15, 2013 at 9:50 PM, Thawan Kooburat <[email protected]>
>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Max snapshot size:
>>>>>>>>>> 
>>>>>>>>>> Here is my take on these issue,  others feel free to add or
>>>>>> correct.
>>>>>>>>>> 
>>>>>>>>>> 1. Depends on how much RAM your machine has.  Snapshot is
>>>> should be
>>>>>>>>>> less than the available RAM since everything is loaded into
>>>> memory.
>>>>>>>>>> 2. Depends on what is the availability guarantee that the client
>>>>>> needs.
>>>>>>>>>> If there is leader election, every machine need to reload the
>>>> data
>>>>>>>>>> from disk. So the quorum will be down for at least the same as
>>>>>>>>>> snapshot
>>>>>>>>> loading
>>>>>>>>>> time. The session timeout on the client side should be at least
>>>>>>>>>> longer than expected downtime during leader election.
>>>>>>>>>> 
>>>>>>>>>> --
>>>>>>>>>> Thawan Kooburat
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On 7/15/13 8:46 PM, "Sergey Maslyakov" <[email protected]>
>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> I have a couple of sizing questions to the users and
>>>> developers.
>>>>>>>>>>> Hope,
>>>>>>>>> you
>>>>>>>>>>> don't mind answering those.
>>>>>>>>>>> 
>>>>>>>>>>> What is the guideline for the maximum reasonable size of a
>>>>>> DataTree
>>>>>>>>> that a
>>>>>>>>>>> single ZK server can manage? If ZK server writes out a
>>>> snapshot of
>>>>>>>>>>> about 1GB in size, is it pushed beyond the limits or is it
>>>> still
>>>>>>>> manageable?
>>>>>>>>> If
>>>>>>>>>>> so, where is the critical threshold when ZK is really being
>>>>>> abused?
>>>>>>>>>>> 
>>>>>>>>>>> Similarly, how can I estimate the propagation delay of a change
>>>>>>>>>>> across
>>>>>>>>> an
>>>>>>>>>>> ensemble of three ZK servers?
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Thank you,
>>>>>>>>>>> /Sergey
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>> 
>>>> 
>> 
> 

Reply via email to