Hey folks. I've been on vacation. My 0.02 - given the release candidate is well underway, has sufficient votes/time to finalize, this is not a regression in 3.4.12 and it's not yet committed I would think we finalize/push 3.4.12 then quickly followup with a 3.4.13 that addresses this. Alex could be the RM given his interest/advocacy.
Regards, Patrick On Fri, Apr 13, 2018 at 11:55 AM, Abraham Fine <af...@apache.org> wrote: > Given that the primary driver of this release is to fix an issue with the > misuse of dataDir and dataLogDir I would rather see this release make it > out the door with minimal additional changes to core functionality so > people can more confidently upgrade. > > What do you think Pat? > > Abe > > On Fri, Apr 13, 2018, at 11:37, Alexander Shraer wrote: > > Now that we have the fix, why delay it to next release? > > > > On Fri, Apr 13, 2018 at 11:09 AM Abraham Fine <af...@apache.org> wrote: > > > > > Let's wait until the next release to include this fix. > > > > > > On Mon, Apr 9, 2018, at 15:14, Alexander Shraer wrote: > > > > Hi, > > > > > > > > Please take a look on the new PR for ZK-2959: > > > > https://github.com/apache/zookeeper/pull/500 > > > > If there are no further comments, I can commit it. > > > > > > > > Thanks, > > > > Alex > > > > > > > > On Fri, Apr 6, 2018 at 11:33 AM, Alexander Shraer <shra...@gmail.com > > > > > wrote: > > > > > > > > > Hi, > > > > > > > > > > The bug described in ZOOKEEPER-2959 > > > > > <https://issues.apache.org/jira/browse/ZOOKEEPER-2959> is that > > > > > getEpochToPropose an waitForEpochAck do not distinguish between > > > followers > > > > > and observers. > > > > > This can cause a candidate leader's acceptedEpoch to be updated > with > > > only > > > > > support from observers. Same for waitForEpochAck - passing this > method > > > > > allows the candidate leader to update the currentEpoch. The latter > > > helps > > > > > this server to win FLE elections continuously, and the former > > > > > (acceptedEpoch) > > > > > causes anyone trying to connect to the server to think that it has > more > > > > > up-to-date data and trucate their logs to match. > > > > > > > > > > > > > > > Alex > > > > > > > > > > On Fri, Apr 6, 2018 at 10:04 AM, Fangmin Lv <lvfang...@gmail.com> > > > wrote: > > > > > > > > > >> Hi Alex, > > > > >> > > > > >> Can you give more details about the data loss scenario in Jira > > > > >> ZOOKEEPER-2959 <https://issues.apache.org/ > jira/browse/ZOOKEEPER-2959 > > > >? > > > > >> As far as I know, the leader will ignore the observers' ACK in > > > > >> waitForNewLeaderAck, so it will not start serve traffic until it > > > received > > > > >> the actual quorum ACK, if it doesn't have enough followers support > > > before > > > > >> timeout, it will quit leading and it's learners will re-sync with > new > > > > >> leader. > > > > >> > > > > >> Thanks, > > > > >> Fangmin > > > > >> > > > > >> On Thu, Apr 5, 2018 at 12:57 PM, Alexander Shraer < > shra...@gmail.com> > > > > >> wrote: > > > > >> > > > > >>> Btw we actually observed the described issue (data loss), > thankfully > > > in a > > > > >>> test environment. So I thought this is important to share with > the > > > > >>> community. > > > > >>> > > > > >>> Unfortunately I don’t have time to run a new ZK release for > this, so > > > I’m > > > > >>> not going to -1 your candidate, but we are actively working on a > fix > > > (ie > > > > >>> a > > > > >>> test at this point) and I can commit that as soon as we have > that. > > > > >>> > > > > >>> It may be worth while to delay the release by a few more days, > but > > > it’s > > > > >>> totally up to you since you’re running it. > > > > >>> > > > > >>> Cheers > > > > >>> Alex > > > > >>> On Thu, Apr 5, 2018 at 12:47 PM Andor Molnar <an...@cloudera.com > > > > > wrote: > > > > >>> > > > > >>> > Got that. I still believe it's a completely valid issue which > has > > > to be > > > > >>> > addressed, but it's not a showstopper. I'm afraid we're not > going > > > to > > > > >>> > convince each other, so it's probably Abe's call if he want to > > > create > > > > >>> > another release candidate for the fix. > > > > >>> > > > > > >>> > I reviewed the code on github and I think it just needs to be > > > covered > > > > >>> with > > > > >>> > a unit test to be complete. > > > > >>> > > > > > >>> > Regards, > > > > >>> > Andor > > > > >>> > > > > > >>> > > > > > >>> > > > > > >>> > On Thu, Apr 5, 2018 at 9:05 PM, Alexander Shraer < > > > shra...@gmail.com> > > > > >>> > wrote: > > > > >>> > > > > > >>> > > Yes sort of, FLE is finished, then enough observer's messages > > > reach > > > > >>> the > > > > >>> > > leader before participant's messages do. > > > > >>> > > Whether its rare depends on the number of observers and > > > > >>> participants. For > > > > >>> > > example with very few participants and many observers > > > > >>> > > your chance of hitting this are quite high. > > > > >>> > > > > > > >>> > > Alex > > > > >>> > > > > > > >>> > > On Thu, Apr 5, 2018 at 11:44 AM, Andor Molnar < > > > an...@cloudera.com> > > > > >>> > wrote: > > > > >>> > > > > > > >>> > > > Maybe I'm missing something here, but this looks like a > rare > > > edge > > > > >>> case > > > > >>> > to > > > > >>> > > > me. Participants must finish the leader election > successfully > > > and > > > > >>> right > > > > >>> > > > after enough followers should fail to send epoch to the > > > leader, so > > > > >>> > > > observers can take it over. > > > > >>> > > > > > > > >>> > > > Is that description accurate? > > > > >>> > > > > > > > >>> > > > Andor > > > > >>> > > > > > > > >>> > > > > > > > >>> > > > On Thu, Apr 5, 2018 at 7:35 PM, Alexander Shraer < > > > > >>> shra...@gmail.com> > > > > >>> > > > wrote: > > > > >>> > > > > > > > >>> > > > > To clarify - in a deployment with observers this bug can > > > > >>> potentially > > > > >>> > > > cause > > > > >>> > > > > data loss. A server could be elected leader based just > on the > > > > >>> support > > > > >>> > > of > > > > >>> > > > > observers, even if this servers data is stale wrt other > > > > >>> followers. > > > > >>> > > > > > > > > >>> > > > > It is certainly a blocker, just not sure if for 3.4.11 or > > > 3.4.12. > > > > >>> > > > > > > > > >>> > > > > > > > > >>> > > > > Alex > > > > >>> > > > > On Thu, Apr 5, 2018 at 10:29 AM Andor Molnar < > > > an...@cloudera.com > > > > >>> > > > > > >>> > > wrote: > > > > >>> > > > > > > > > >>> > > > > > I don't think it's a blocker. > > > > >>> > > > > > The jira and PR has been open since last December and > > > 3.4.11 > > > > >>> has > > > > >>> > > > released > > > > >>> > > > > > without it. > > > > >>> > > > > > > > > > >>> > > > > > Although this bug is also important to fix, I believe > it's > > > more > > > > >>> > > > important > > > > >>> > > > > > to release a fix for the regression we've found in > 3.4.11 > > > asap. > > > > >>> > > > > > > > > > >>> > > > > > Abe, any thoughts? > > > > >>> > > > > > > > > > >>> > > > > > Regards, > > > > >>> > > > > > Andor > > > > >>> > > > > > > > > > >>> > > > > > > > > > >>> > > > > > > > > > >>> > > > > > On Thu, Apr 5, 2018 at 7:00 PM, Alexander Shraer < > > > > >>> > shra...@gmail.com> > > > > >>> > > > > > wrote: > > > > >>> > > > > > > > > > >>> > > > > > > Sorry for coming in at the last moment. I'm not sure > > > when the > > > > >>> > next > > > > >>> > > > 3.4 > > > > >>> > > > > > > release is scheduled, so just wanted to mention this > bug, > > > > >>> > > > > > > which I believe is a blocker for either this or next > > > release: > > > > >>> > > > > > > https://issues.apache.org/jira/browse/ZOOKEEPER-2959 > > > > >>> > > > > > > > > > > >>> > > > > > > Best, > > > > >>> > > > > > > Alex > > > > >>> > > > > > > > > > > >>> > > > > > > On Thu, Apr 5, 2018 at 9:09 AM, Ted Yu < > > > yuzhih...@gmail.com> > > > > >>> > > wrote: > > > > >>> > > > > > > > > > > >>> > > > > > > > Can the vote be closed ? > > > > >>> > > > > > > > > > > > >>> > > > > > > > It seems we have enough +1's > > > > >>> > > > > > > > > > > > >>> > > > > > > > Thanks > > > > >>> > > > > > > > > > > > >>> > > > > > > > > > > >>> > > > > > > > > > >>> > > > > > > > > >>> > > > > > > > >>> > > > > > > >>> > > > > > >>> > > > > >> > > > > >> > > > > > > > > >