Thanks for following up Alex.
On Fri, Apr 13, 2018, at 14:48, Alexander Shraer wrote: > We discussed with Pat offline and agreed to go without this patch, > especially since we need to patch 3 branches: 3.4, 3.5 and master.> We'll > prepare 3.5 and master and then commit all 3 together in time > for the next release. So Abe, please go ahead with your release.> > Alex > > On Fri, Apr 13, 2018 at 2:26 PM, Patrick Hunt > <ph...@apache.org> wrote:>> Hey folks. I've been on vacation. My 0.02 - given > the release >> candidate is>> well underway, has sufficient votes/time to finalize, this >> is not a>> regression in 3.4.12 and it's not yet committed I would think we >> finalize/push 3.4.12 then quickly followup with a 3.4.13 that >> addresses>> this. Alex could be the RM given his interest/advocacy. >> >> Regards, >> >> Patrick >> >> >> On Fri, Apr 13, 2018 at 11:55 AM, Abraham Fine >> <af...@apache.org> wrote:>> >> > Given that the primary driver of this release is to fix an issue >> > with the>> > misuse of dataDir and dataLogDir I would rather see this >> release >> > make it>> > out the door with minimal additional changes to core >> > functionality so>> > people can more confidently upgrade. >> > >> > What do you think Pat? >> > >> > Abe >> > >> > On Fri, Apr 13, 2018, at 11:37, Alexander Shraer wrote: >> > > Now that we have the fix, why delay it to next release? >> > > >> > > On Fri, Apr 13, 2018 at 11:09 AM Abraham Fine <af...@apache.org> >> > > wrote:>> > > >> > > > Let's wait until the next release to include this fix. >> > > > >> > > > On Mon, Apr 9, 2018, at 15:14, Alexander Shraer wrote: >> > > > > Hi, >> > > > > >> > > > > Please take a look on the new PR for ZK-2959: >> > > > > https://github.com/apache/zookeeper/pull/500 >> > > > > If there are no further comments, I can commit it. >> > > > > >> > > > > Thanks, >> > > > > Alex >> > > > > >> > > > > On Fri, Apr 6, 2018 at 11:33 AM, Alexander Shraer >> > > > > <shra...@gmail.com>> > > >> > > > wrote: >> > > > > >> > > > > > Hi, >> > > > > > >> > > > > > The bug described in ZOOKEEPER-2959 >> > > > > > <https://issues.apache.org/jira/browse/ZOOKEEPER-2959> is >> > > > > > that>> > > > > > getEpochToPropose an waitForEpochAck do not >> distinguish >> > > > > > between>> > > > followers >> > > > > > and observers. >> > > > > > This can cause a candidate leader's acceptedEpoch to be >> > > > > > updated>> > with >> > > > only >> > > > > > support from observers. Same for waitForEpochAck - passing >> > > > > > this>> > method >> > > > > > allows the candidate leader to update the currentEpoch. >> > > > > > The latter>> > > > helps >> > > > > > this server to win FLE elections continuously, and the >> > > > > > former>> > > > > > (acceptedEpoch) >> > > > > > causes anyone trying to connect to the server to think >> > > > > > that it has>> > more >> > > > > > up-to-date data and trucate their logs to match. >> > > > > > >> > > > > > >> > > > > > Alex >> > > > > > >> > > > > > On Fri, Apr 6, 2018 at 10:04 AM, Fangmin Lv >> > > > > > <lvfang...@gmail.com>>> > > > wrote: >> > > > > > >> > > > > >> Hi Alex, >> > > > > >> >> > > > > >> Can you give more details about the data loss scenario in >> > > > > >> Jira>> > > > > >> ZOOKEEPER-2959 <https://issues.apache.org/ >> > jira/browse/ZOOKEEPER-2959 >> > > > >? >> > > > > >> As far as I know, the leader will ignore the observers' >> > > > > >> ACK in>> > > > > >> waitForNewLeaderAck, so it will not start >> serve traffic >> > > > > >> until it>> > > > received >> > > > > >> the actual quorum ACK, if it doesn't have enough >> > > > > >> followers support>> > > > before >> > > > > >> timeout, it will quit leading and it's learners will re- >> > > > > >> sync with>> > new >> > > > > >> leader. >> > > > > >> >> > > > > >> Thanks, >> > > > > >> Fangmin >> > > > > >> >> > > > > >> On Thu, Apr 5, 2018 at 12:57 PM, Alexander Shraer < >> > shra...@gmail.com> >> > > > > >> wrote: >> > > > > >> >> > > > > >>> Btw we actually observed the described issue (data >> > > > > >>> loss),>> > thankfully >> > > > in a >> > > > > >>> test environment. So I thought this is important to >> > > > > >>> share with>> > the >> > > > > >>> community. >> > > > > >>> >> > > > > >>> Unfortunately I don’t have time to run a new ZK release >> > > > > >>> for>> > this, so >> > > > I’m >> > > > > >>> not going to -1 your candidate, but we are actively >> > > > > >>> working on a>> > fix >> > > > (ie >> > > > > >>> a >> > > > > >>> test at this point) and I can commit that as soon as we >> > > > > >>> have>> > that. >> > > > > >>> >> > > > > >>> It may be worth while to delay the release by a few more >> > > > > >>> days,>> > but >> > > > it’s >> > > > > >>> totally up to you since you’re running it. >> > > > > >>> >> > > > > >>> Cheers >> > > > > >>> Alex >> > > > > >>> On Thu, Apr 5, 2018 at 12:47 PM Andor Molnar >> > > > > >>> <an...@cloudera.com>> > > >> > > > wrote: >> > > > > >>> >> > > > > >>> > Got that. I still believe it's a completely valid >> > > > > >>> > issue which>> > has >> > > > to be >> > > > > >>> > addressed, but it's not a showstopper. I'm afraid >> > > > > >>> > we're not>> > going >> > > > to >> > > > > >>> > convince each other, so it's probably Abe's call if he >> > > > > >>> > want to>> > > > create >> > > > > >>> > another release candidate for the fix. >> > > > > >>> > >> > > > > >>> > I reviewed the code on github and I think it just >> > > > > >>> > needs to be>> > > > covered >> > > > > >>> with >> > > > > >>> > a unit test to be complete. >> > > > > >>> > >> > > > > >>> > Regards, >> > > > > >>> > Andor >> > > > > >>> > >> > > > > >>> > >> > > > > >>> > >> > > > > >>> > On Thu, Apr 5, 2018 at 9:05 PM, Alexander Shraer < >> > > > shra...@gmail.com> >> > > > > >>> > wrote: >> > > > > >>> > >> > > > > >>> > > Yes sort of, FLE is finished, then enough observer's >> > > > > >>> > > messages>> > > > reach >> > > > > >>> the >> > > > > >>> > > leader before participant's messages do. >> > > > > >>> > > Whether its rare depends on the number of observers >> > > > > >>> > > and>> > > > > >>> participants. For >> > > > > >>> > > example with very few participants and many >> > > > > >>> > > observers>> > > > > >>> > > your chance of hitting this >> are quite high. >> > > > > >>> > > >> > > > > >>> > > Alex >> > > > > >>> > > >> > > > > >>> > > On Thu, Apr 5, 2018 at 11:44 AM, Andor Molnar < >> > > > an...@cloudera.com> >> > > > > >>> > wrote: >> > > > > >>> > > >> > > > > >>> > > > Maybe I'm missing something here, but this looks >> > > > > >>> > > > like a>> > rare >> > > > edge >> > > > > >>> case >> > > > > >>> > to >> > > > > >>> > > > me. Participants must finish the leader election >> > successfully >> > > > and >> > > > > >>> right >> > > > > >>> > > > after enough followers should fail to send epoch >> > > > > >>> > > > to the>> > > > leader, so >> > > > > >>> > > > observers can take it over. >> > > > > >>> > > > >> > > > > >>> > > > Is that description accurate? >> > > > > >>> > > > >> > > > > >>> > > > Andor >> > > > > >>> > > > >> > > > > >>> > > > >> > > > > >>> > > > On Thu, Apr 5, 2018 at 7:35 PM, Alexander Shraer <>> > > >> > > >>> shra...@gmail.com> >> > > > > >>> > > > wrote: >> > > > > >>> > > > >> > > > > >>> > > > > To clarify - in a deployment with observers this >> > > > > >>> > > > > bug can>> > > > > >>> potentially >> > > > > >>> > > > cause >> > > > > >>> > > > > data loss. A server could be elected leader >> > > > > >>> > > > > based just>> > on the >> > > > > >>> support >> > > > > >>> > > of >> > > > > >>> > > > > observers, even if this servers data is stale >> > > > > >>> > > > > wrt other>> > > > > >>> followers. >> > > > > >>> > > > > >> > > > > >>> > > > > It is certainly a blocker, just not sure if for >> > > > > >>> > > > > 3.4.11 or>> > > > 3.4.12. >> > > > > >>> > > > > >> > > > > >>> > > > > >> > > > > >>> > > > > Alex >> > > > > >>> > > > > On Thu, Apr 5, 2018 at 10:29 AM Andor Molnar < >> > > > an...@cloudera.com >> > > > > >>> > >> > > > > >>> > > wrote: >> > > > > >>> > > > > >> > > > > >>> > > > > > I don't think it's a blocker. >> > > > > >>> > > > > > The jira and PR has been open since last >> > > > > >>> > > > > > December and>> > > > 3.4.11 >> > > > > >>> has >> > > > > >>> > > > released >> > > > > >>> > > > > > without it. >> > > > > >>> > > > > > >> > > > > >>> > > > > > Although this bug is also important to fix, I >> > > > > >>> > > > > > believe>> > it's >> > > > more >> > > > > >>> > > > important >> > > > > >>> > > > > > to release a fix for the regression we've >> > > > > >>> > > > > > found in>> > 3.4.11 >> > > > asap. >> > > > > >>> > > > > > >> > > > > >>> > > > > > Abe, any thoughts? >> > > > > >>> > > > > > >> > > > > >>> > > > > > Regards, >> > > > > >>> > > > > > Andor >> > > > > >>> > > > > > >> > > > > >>> > > > > > >> > > > > >>> > > > > > >> > > > > >>> > > > > > On Thu, Apr 5, 2018 at 7:00 PM, Alexander >> > > > > >>> > > > > > Shraer <>> > > > > >>> > shra...@gmail.com> >> > > > > >>> > > > > > wrote: >> > > > > >>> > > > > > >> > > > > >>> > > > > > > Sorry for coming in at the last moment. I'm >> > > > > >>> > > > > > > not sure>> > > > when the >> > > > > >>> > next >> > > > > >>> > > > 3.4 >> > > > > >>> > > > > > > release is scheduled, so just wanted to >> > > > > >>> > > > > > > mention this>> > bug, >> > > > > >>> > > > > > > which I believe is a blocker for either this >> > > > > >>> > > > > > > or next>> > > > release: >> > > > > >>> > > > > > > >> https://issues.apache.org/jira/browse/ZOOKEEPER-2959>> > > > > >>> > > > > >> > > >> > > > > >>> > > > > > > Best, >> > > > > >>> > > > > > > Alex >> > > > > >>> > > > > > > >> > > > > >>> > > > > > > On Thu, Apr 5, 2018 at 9:09 AM, Ted Yu < >> > > > yuzhih...@gmail.com> >> > > > > >>> > > wrote: >> > > > > >>> > > > > > > >> > > > > >>> > > > > > > > Can the vote be closed ? >> > > > > >>> > > > > > > > >> > > > > >>> > > > > > > > It seems we have enough +1's >> > > > > >>> > > > > > > > >> > > > > >>> > > > > > > > Thanks >> > > > > >>> > > > > > > > >> > > > > >>> > > > > > > >> > > > > >>> > > > > > >> > > > > >>> > > > > >> > > > > >>> > > > >> > > > > >>> > > >> > > > > >>> > >> > > > > >>> >> > > > > >> >> > > > > >> >> > > > > > >> > > > >> >