On 1), load state from disk to find last zxid, does this mean it loads snapshot or simply reads the tail of transaction log?.
On Wed, Jul 17, 2013 at 6:43 AM, Flavio Junqueira <[email protected]>wrote: > I need to also mention ZOOKEEPER-1549 in the context of point (2) below. > That's a blocker for 3.5.0. > > -Flavio > > On Jul 17, 2013, at 12:30 PM, Flavio Junqueira <[email protected]> > wrote: > > > Moving the discussion to dev but keeping user on CC. > > > > Let's step back. The reason why we started the latest discussion in this > thread was because Kishore is concerned about recovery time. There are a > number of improvements we have been looking at for the next release, let me > go over my current understanding of the main points that add to the > recovery time: > > > > 1- Before we even start leader election, each server loads state from > disk to determine its last zxid. The last zxid is used in the election; > > 2- Once the leader is elected, it loads state from disk and take a > snapshot. Loading the database again is unecessary (ZOOKEEPER-1642) and the > snapshot adds latency. In fact, it is not even correct to have it there > (ZOOKEEPER-1558). > > 3- A follower takes a snapshot before acknowledging the NEWLEADER > message, so the leader has to wait until a quorum of followers finishes > their snapshot. > > > > The proposal I've heard here is to touch (1). For now, I'd rather keep > (1) as is and focus on fixing (2). We might be able to do something about > (3) and I'm actually not sure if there has been a discussion about it or > not. > > > > -Flavio > > > > On Jul 17, 2013, at 5:40 AM, Thawan Kooburat <[email protected]> wrote: > > > >> Client will get session expire event only when a server explicitly tells > >> the client. So any established sessions will remain in a disconnected > >> state during the period > >> > >> So my comment about the need for longer session timeout might be > >> incorrect. While the quorum is down during leader election, session > won't > >> expire during this period. When the quorum comes back, the client have > to > >> reconnect within session timeout in order to resume the session. > However, > >> client won't be able to issue any read/write request or create a new > >> session while the quorum is down. > >> > >> However, some application may need a stronger consistency guarantee. > They > >> will have a special logic to abort the client if it was disconnected for > >> an extended period. This is because the client won't be able to tell if > >> the quorum is down or there is a network partition between the client > and > >> the quorum. > >> > >> > >> -- > >> Thawan Kooburat > >> > >> > >> > >> > >> > >> On 7/16/13 6:46 PM, "kishore g" <[email protected]> wrote: > >> > >>> Thanks Thawan. Another question to follow up, so lets say client c1 is > >>> connected to leader and leader fails. Now c1 is trying to connect to > >>> another zk server but all servers are busy loading snapshot and can > take a > >>> minute or two. According to Flavio zk servers dont accept any request > >>> while > >>> synchronization, but most clients dont keep that high connection > timeout. > >>> So does this mean clients will timeout on connection?. Is my > understanding > >>> correct or zk servers will accept connection requests but reject > >>> read/write > >>> requests. > >>> > >>> thanks, > >>> Kishore G > >>> > >>> > >>> On Tue, Jul 16, 2013 at 3:45 PM, Thawan Kooburat <[email protected]> > wrote: > >>> > >>>> There is a plan to work on this optimization ZOOKEEPER-1674. > >>>> > >>>> > >>>> -- > >>>> Thawan Kooburat > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> On 7/16/13 1:37 PM, "kishore g" <[email protected]> wrote: > >>>> > >>>>> All servers in the quorum reading the snapshot from disk as part of > the > >>>>> synchronization phase. From Thawan's email it looks like when ever > >>>> there > >>>>> is > >>>>> a leader election, all zk servers read the snapshot from disk. I am > not > >>>>> sure why all servers should reload the snapshot from disk as this > >>>>> increases > >>>>> unavailability time. > >>>>> > >>>>> > >>>>> On Tue, Jul 16, 2013 at 12:35 PM, Flavio Junqueira > >>>>> <[email protected]>wrote: > >>>>> > >>>>>> The synchronization phase is part of the protocol and we use it to > >>>>>> guarantee that we expose a consistent view of the state. During the > >>>>>> synchronization phase, servers do not accept requests. > >>>>>> > >>>>>> Which behavior are you proposing we change, Kishore? > >>>>>> > >>>>>> -Flavio > >>>>>> > >>>>>> On Jul 16, 2013, at 7:04 PM, kishore g <[email protected]> wrote: > >>>>>> > >>>>>>> Thanks for clarification Flavio. Does this mean during the leader > >>>>>> election, > >>>>>>> both reads and writes are not supported?. Do we start a separate > >>>>>>> thread/jira of changing this behavior?. > >>>>>>> > >>>>>>> thanks, > >>>>>>> Kishore G > >>>>>>> > >>>>>>> > >>>>>>> On Tue, Jul 16, 2013 at 9:16 AM, Flavio Junqueira > >>>>>> <[email protected] > >>>>>>> wrote: > >>>>>>> > >>>>>>>> The disk state should be the authoritative state of a server, so > >>>> if I > >>>>>>>> remember correctly, we load the database as a way of validating > >>>> the > >>>>>> disk > >>>>>>>> state. I don't claim that this is strictly necessary, but if we > >>>> are > >>>>>> to > >>>>>>>> change it, then I would need to think this through. > >>>>>>>> > >>>>>>>> About leader election, if a leader loses support from a quorum of > >>>>>>>> followers, > >>>>>>>> then it will drop leadership. Any event that causes a follower to > >>>>>> stop > >>>>>>>> receiving messages from the leader or the follower to disconnect > >>>> from > >>>>>> the > >>>>>>>> leader will make it stop supporting the current leader. > >>>>>>>> > >>>>>>>> -Flavio > >>>>>>>> > >>>>>>>> -----Original Message----- > >>>>>>>> From: Sergey Maslyakov [mailto:[email protected]] > >>>>>>>> Sent: 16 July 2013 16:16 > >>>>>>>> To: [email protected] > >>>>>>>> Subject: Re: Maximum size of a snapshot > >>>>>>>> > >>>>>>>> And another extension on top of Kishore's question: do the > >>>>>> reelections > >>>>>>>> happen if the previously elected leader remains in the cluster? In > >>>>>> other > >>>>>>>> words, what events can trigger re-election and the corresponding > >>>>>> temporary > >>>>>>>> degradation of the service provided by Zookeeper? > >>>>>>>> > >>>>>>>> > >>>>>>>> Thank you, > >>>>>>>> /Sergey > >>>>>>>> > >>>>>>>> > >>>>>>>> On Tue, Jul 16, 2013 at 2:21 AM, kishore g <[email protected]> > >>>>>> wrote: > >>>>>>>> > >>>>>>>>> Regarding #2. Is that really true that during leader election > >>>> every > >>>>>>>>> machine reloads snapshot data from disk? Any reason why this is > >>>>>> needed > >>>>>>>>> unless it really needs to truncate or undo conflicting > >>>> transactions > >>>>>>>> already applied? > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> On Mon, Jul 15, 2013 at 9:50 PM, Thawan Kooburat <[email protected]> > >>>>>> wrote: > >>>>>>>>> > >>>>>>>>>> Max snapshot size: > >>>>>>>>>> > >>>>>>>>>> Here is my take on these issue, others feel free to add or > >>>>>> correct. > >>>>>>>>>> > >>>>>>>>>> 1. Depends on how much RAM your machine has. Snapshot is > >>>> should be > >>>>>>>>>> less than the available RAM since everything is loaded into > >>>> memory. > >>>>>>>>>> 2. Depends on what is the availability guarantee that the client > >>>>>> needs. > >>>>>>>>>> If there is leader election, every machine need to reload the > >>>> data > >>>>>>>>>> from disk. So the quorum will be down for at least the same as > >>>>>>>>>> snapshot > >>>>>>>>> loading > >>>>>>>>>> time. The session timeout on the client side should be at least > >>>>>>>>>> longer than expected downtime during leader election. > >>>>>>>>>> > >>>>>>>>>> -- > >>>>>>>>>> Thawan Kooburat > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> On 7/15/13 8:46 PM, "Sergey Maslyakov" <[email protected]> > >>>> wrote: > >>>>>>>>>> > >>>>>>>>>>> I have a couple of sizing questions to the users and > >>>> developers. > >>>>>>>>>>> Hope, > >>>>>>>>> you > >>>>>>>>>>> don't mind answering those. > >>>>>>>>>>> > >>>>>>>>>>> What is the guideline for the maximum reasonable size of a > >>>>>> DataTree > >>>>>>>>> that a > >>>>>>>>>>> single ZK server can manage? If ZK server writes out a > >>>> snapshot of > >>>>>>>>>>> about 1GB in size, is it pushed beyond the limits or is it > >>>> still > >>>>>>>> manageable? > >>>>>>>>> If > >>>>>>>>>>> so, where is the critical threshold when ZK is really being > >>>>>> abused? > >>>>>>>>>>> > >>>>>>>>>>> Similarly, how can I estimate the propagation delay of a change > >>>>>>>>>>> across > >>>>>>>>> an > >>>>>>>>>>> ensemble of three ZK servers? > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> Thank you, > >>>>>>>>>>> /Sergey > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>> > >>>>>> > >>>> > >>>> > >> > > > >
