I need to also mention ZOOKEEPER-1549 in the context of point (2) below. That's a blocker for 3.5.0.
-Flavio On Jul 17, 2013, at 12:30 PM, Flavio Junqueira <[email protected]> wrote: > Moving the discussion to dev but keeping user on CC. > > Let's step back. The reason why we started the latest discussion in this > thread was because Kishore is concerned about recovery time. There are a > number of improvements we have been looking at for the next release, let me > go over my current understanding of the main points that add to the recovery > time: > > 1- Before we even start leader election, each server loads state from disk to > determine its last zxid. The last zxid is used in the election; > 2- Once the leader is elected, it loads state from disk and take a snapshot. > Loading the database again is unecessary (ZOOKEEPER-1642) and the snapshot > adds latency. In fact, it is not even correct to have it there > (ZOOKEEPER-1558). > 3- A follower takes a snapshot before acknowledging the NEWLEADER message, so > the leader has to wait until a quorum of followers finishes their snapshot. > > The proposal I've heard here is to touch (1). For now, I'd rather keep (1) as > is and focus on fixing (2). We might be able to do something about (3) and > I'm actually not sure if there has been a discussion about it or not. > > -Flavio > > On Jul 17, 2013, at 5:40 AM, Thawan Kooburat <[email protected]> wrote: > >> Client will get session expire event only when a server explicitly tells >> the client. So any established sessions will remain in a disconnected >> state during the period >> >> So my comment about the need for longer session timeout might be >> incorrect. While the quorum is down during leader election, session won't >> expire during this period. When the quorum comes back, the client have to >> reconnect within session timeout in order to resume the session. However, >> client won't be able to issue any read/write request or create a new >> session while the quorum is down. >> >> However, some application may need a stronger consistency guarantee. They >> will have a special logic to abort the client if it was disconnected for >> an extended period. This is because the client won't be able to tell if >> the quorum is down or there is a network partition between the client and >> the quorum. >> >> >> -- >> Thawan Kooburat >> >> >> >> >> >> On 7/16/13 6:46 PM, "kishore g" <[email protected]> wrote: >> >>> Thanks Thawan. Another question to follow up, so lets say client c1 is >>> connected to leader and leader fails. Now c1 is trying to connect to >>> another zk server but all servers are busy loading snapshot and can take a >>> minute or two. According to Flavio zk servers dont accept any request >>> while >>> synchronization, but most clients dont keep that high connection timeout. >>> So does this mean clients will timeout on connection?. Is my understanding >>> correct or zk servers will accept connection requests but reject >>> read/write >>> requests. >>> >>> thanks, >>> Kishore G >>> >>> >>> On Tue, Jul 16, 2013 at 3:45 PM, Thawan Kooburat <[email protected]> wrote: >>> >>>> There is a plan to work on this optimization ZOOKEEPER-1674. >>>> >>>> >>>> -- >>>> Thawan Kooburat >>>> >>>> >>>> >>>> >>>> >>>> On 7/16/13 1:37 PM, "kishore g" <[email protected]> wrote: >>>> >>>>> All servers in the quorum reading the snapshot from disk as part of the >>>>> synchronization phase. From Thawan's email it looks like when ever >>>> there >>>>> is >>>>> a leader election, all zk servers read the snapshot from disk. I am not >>>>> sure why all servers should reload the snapshot from disk as this >>>>> increases >>>>> unavailability time. >>>>> >>>>> >>>>> On Tue, Jul 16, 2013 at 12:35 PM, Flavio Junqueira >>>>> <[email protected]>wrote: >>>>> >>>>>> The synchronization phase is part of the protocol and we use it to >>>>>> guarantee that we expose a consistent view of the state. During the >>>>>> synchronization phase, servers do not accept requests. >>>>>> >>>>>> Which behavior are you proposing we change, Kishore? >>>>>> >>>>>> -Flavio >>>>>> >>>>>> On Jul 16, 2013, at 7:04 PM, kishore g <[email protected]> wrote: >>>>>> >>>>>>> Thanks for clarification Flavio. Does this mean during the leader >>>>>> election, >>>>>>> both reads and writes are not supported?. Do we start a separate >>>>>>> thread/jira of changing this behavior?. >>>>>>> >>>>>>> thanks, >>>>>>> Kishore G >>>>>>> >>>>>>> >>>>>>> On Tue, Jul 16, 2013 at 9:16 AM, Flavio Junqueira >>>>>> <[email protected] >>>>>>> wrote: >>>>>>> >>>>>>>> The disk state should be the authoritative state of a server, so >>>> if I >>>>>>>> remember correctly, we load the database as a way of validating >>>> the >>>>>> disk >>>>>>>> state. I don't claim that this is strictly necessary, but if we >>>> are >>>>>> to >>>>>>>> change it, then I would need to think this through. >>>>>>>> >>>>>>>> About leader election, if a leader loses support from a quorum of >>>>>>>> followers, >>>>>>>> then it will drop leadership. Any event that causes a follower to >>>>>> stop >>>>>>>> receiving messages from the leader or the follower to disconnect >>>> from >>>>>> the >>>>>>>> leader will make it stop supporting the current leader. >>>>>>>> >>>>>>>> -Flavio >>>>>>>> >>>>>>>> -----Original Message----- >>>>>>>> From: Sergey Maslyakov [mailto:[email protected]] >>>>>>>> Sent: 16 July 2013 16:16 >>>>>>>> To: [email protected] >>>>>>>> Subject: Re: Maximum size of a snapshot >>>>>>>> >>>>>>>> And another extension on top of Kishore's question: do the >>>>>> reelections >>>>>>>> happen if the previously elected leader remains in the cluster? In >>>>>> other >>>>>>>> words, what events can trigger re-election and the corresponding >>>>>> temporary >>>>>>>> degradation of the service provided by Zookeeper? >>>>>>>> >>>>>>>> >>>>>>>> Thank you, >>>>>>>> /Sergey >>>>>>>> >>>>>>>> >>>>>>>> On Tue, Jul 16, 2013 at 2:21 AM, kishore g <[email protected]> >>>>>> wrote: >>>>>>>> >>>>>>>>> Regarding #2. Is that really true that during leader election >>>> every >>>>>>>>> machine reloads snapshot data from disk? Any reason why this is >>>>>> needed >>>>>>>>> unless it really needs to truncate or undo conflicting >>>> transactions >>>>>>>> already applied? >>>>>>>>> >>>>>>>>> >>>>>>>>> On Mon, Jul 15, 2013 at 9:50 PM, Thawan Kooburat <[email protected]> >>>>>> wrote: >>>>>>>>> >>>>>>>>>> Max snapshot size: >>>>>>>>>> >>>>>>>>>> Here is my take on these issue, others feel free to add or >>>>>> correct. >>>>>>>>>> >>>>>>>>>> 1. Depends on how much RAM your machine has. Snapshot is >>>> should be >>>>>>>>>> less than the available RAM since everything is loaded into >>>> memory. >>>>>>>>>> 2. Depends on what is the availability guarantee that the client >>>>>> needs. >>>>>>>>>> If there is leader election, every machine need to reload the >>>> data >>>>>>>>>> from disk. So the quorum will be down for at least the same as >>>>>>>>>> snapshot >>>>>>>>> loading >>>>>>>>>> time. The session timeout on the client side should be at least >>>>>>>>>> longer than expected downtime during leader election. >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Thawan Kooburat >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On 7/15/13 8:46 PM, "Sergey Maslyakov" <[email protected]> >>>> wrote: >>>>>>>>>> >>>>>>>>>>> I have a couple of sizing questions to the users and >>>> developers. >>>>>>>>>>> Hope, >>>>>>>>> you >>>>>>>>>>> don't mind answering those. >>>>>>>>>>> >>>>>>>>>>> What is the guideline for the maximum reasonable size of a >>>>>> DataTree >>>>>>>>> that a >>>>>>>>>>> single ZK server can manage? If ZK server writes out a >>>> snapshot of >>>>>>>>>>> about 1GB in size, is it pushed beyond the limits or is it >>>> still >>>>>>>> manageable? >>>>>>>>> If >>>>>>>>>>> so, where is the critical threshold when ZK is really being >>>>>> abused? >>>>>>>>>>> >>>>>>>>>>> Similarly, how can I estimate the propagation delay of a change >>>>>>>>>>> across >>>>>>>>> an >>>>>>>>>>> ensemble of three ZK servers? >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Thank you, >>>>>>>>>>> /Sergey >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>> >>>>>> >>>> >>>> >> >
