Now that we have the fix, why delay it to next release? On Fri, Apr 13, 2018 at 11:09 AM Abraham Fine <af...@apache.org> wrote:
> Let's wait until the next release to include this fix. > > On Mon, Apr 9, 2018, at 15:14, Alexander Shraer wrote: > > Hi, > > > > Please take a look on the new PR for ZK-2959: > > https://github.com/apache/zookeeper/pull/500 > > If there are no further comments, I can commit it. > > > > Thanks, > > Alex > > > > On Fri, Apr 6, 2018 at 11:33 AM, Alexander Shraer <shra...@gmail.com> > wrote: > > > > > Hi, > > > > > > The bug described in ZOOKEEPER-2959 > > > <https://issues.apache.org/jira/browse/ZOOKEEPER-2959> is that > > > getEpochToPropose an waitForEpochAck do not distinguish between > followers > > > and observers. > > > This can cause a candidate leader's acceptedEpoch to be updated with > only > > > support from observers. Same for waitForEpochAck - passing this method > > > allows the candidate leader to update the currentEpoch. The latter > helps > > > this server to win FLE elections continuously, and the former > > > (acceptedEpoch) > > > causes anyone trying to connect to the server to think that it has more > > > up-to-date data and trucate their logs to match. > > > > > > > > > Alex > > > > > > On Fri, Apr 6, 2018 at 10:04 AM, Fangmin Lv <lvfang...@gmail.com> > wrote: > > > > > >> Hi Alex, > > >> > > >> Can you give more details about the data loss scenario in Jira > > >> ZOOKEEPER-2959 <https://issues.apache.org/jira/browse/ZOOKEEPER-2959 > >? > > >> As far as I know, the leader will ignore the observers' ACK in > > >> waitForNewLeaderAck, so it will not start serve traffic until it > received > > >> the actual quorum ACK, if it doesn't have enough followers support > before > > >> timeout, it will quit leading and it's learners will re-sync with new > > >> leader. > > >> > > >> Thanks, > > >> Fangmin > > >> > > >> On Thu, Apr 5, 2018 at 12:57 PM, Alexander Shraer <shra...@gmail.com> > > >> wrote: > > >> > > >>> Btw we actually observed the described issue (data loss), thankfully > in a > > >>> test environment. So I thought this is important to share with the > > >>> community. > > >>> > > >>> Unfortunately I don’t have time to run a new ZK release for this, so > I’m > > >>> not going to -1 your candidate, but we are actively working on a fix > (ie > > >>> a > > >>> test at this point) and I can commit that as soon as we have that. > > >>> > > >>> It may be worth while to delay the release by a few more days, but > it’s > > >>> totally up to you since you’re running it. > > >>> > > >>> Cheers > > >>> Alex > > >>> On Thu, Apr 5, 2018 at 12:47 PM Andor Molnar <an...@cloudera.com> > wrote: > > >>> > > >>> > Got that. I still believe it's a completely valid issue which has > to be > > >>> > addressed, but it's not a showstopper. I'm afraid we're not going > to > > >>> > convince each other, so it's probably Abe's call if he want to > create > > >>> > another release candidate for the fix. > > >>> > > > >>> > I reviewed the code on github and I think it just needs to be > covered > > >>> with > > >>> > a unit test to be complete. > > >>> > > > >>> > Regards, > > >>> > Andor > > >>> > > > >>> > > > >>> > > > >>> > On Thu, Apr 5, 2018 at 9:05 PM, Alexander Shraer < > shra...@gmail.com> > > >>> > wrote: > > >>> > > > >>> > > Yes sort of, FLE is finished, then enough observer's messages > reach > > >>> the > > >>> > > leader before participant's messages do. > > >>> > > Whether its rare depends on the number of observers and > > >>> participants. For > > >>> > > example with very few participants and many observers > > >>> > > your chance of hitting this are quite high. > > >>> > > > > >>> > > Alex > > >>> > > > > >>> > > On Thu, Apr 5, 2018 at 11:44 AM, Andor Molnar < > an...@cloudera.com> > > >>> > wrote: > > >>> > > > > >>> > > > Maybe I'm missing something here, but this looks like a rare > edge > > >>> case > > >>> > to > > >>> > > > me. Participants must finish the leader election successfully > and > > >>> right > > >>> > > > after enough followers should fail to send epoch to the > leader, so > > >>> > > > observers can take it over. > > >>> > > > > > >>> > > > Is that description accurate? > > >>> > > > > > >>> > > > Andor > > >>> > > > > > >>> > > > > > >>> > > > On Thu, Apr 5, 2018 at 7:35 PM, Alexander Shraer < > > >>> shra...@gmail.com> > > >>> > > > wrote: > > >>> > > > > > >>> > > > > To clarify - in a deployment with observers this bug can > > >>> potentially > > >>> > > > cause > > >>> > > > > data loss. A server could be elected leader based just on the > > >>> support > > >>> > > of > > >>> > > > > observers, even if this servers data is stale wrt other > > >>> followers. > > >>> > > > > > > >>> > > > > It is certainly a blocker, just not sure if for 3.4.11 or > 3.4.12. > > >>> > > > > > > >>> > > > > > > >>> > > > > Alex > > >>> > > > > On Thu, Apr 5, 2018 at 10:29 AM Andor Molnar < > an...@cloudera.com > > >>> > > > >>> > > wrote: > > >>> > > > > > > >>> > > > > > I don't think it's a blocker. > > >>> > > > > > The jira and PR has been open since last December and > 3.4.11 > > >>> has > > >>> > > > released > > >>> > > > > > without it. > > >>> > > > > > > > >>> > > > > > Although this bug is also important to fix, I believe it's > more > > >>> > > > important > > >>> > > > > > to release a fix for the regression we've found in 3.4.11 > asap. > > >>> > > > > > > > >>> > > > > > Abe, any thoughts? > > >>> > > > > > > > >>> > > > > > Regards, > > >>> > > > > > Andor > > >>> > > > > > > > >>> > > > > > > > >>> > > > > > > > >>> > > > > > On Thu, Apr 5, 2018 at 7:00 PM, Alexander Shraer < > > >>> > shra...@gmail.com> > > >>> > > > > > wrote: > > >>> > > > > > > > >>> > > > > > > Sorry for coming in at the last moment. I'm not sure > when the > > >>> > next > > >>> > > > 3.4 > > >>> > > > > > > release is scheduled, so just wanted to mention this bug, > > >>> > > > > > > which I believe is a blocker for either this or next > release: > > >>> > > > > > > https://issues.apache.org/jira/browse/ZOOKEEPER-2959 > > >>> > > > > > > > > >>> > > > > > > Best, > > >>> > > > > > > Alex > > >>> > > > > > > > > >>> > > > > > > On Thu, Apr 5, 2018 at 9:09 AM, Ted Yu < > yuzhih...@gmail.com> > > >>> > > wrote: > > >>> > > > > > > > > >>> > > > > > > > Can the vote be closed ? > > >>> > > > > > > > > > >>> > > > > > > > It seems we have enough +1's > > >>> > > > > > > > > > >>> > > > > > > > Thanks > > >>> > > > > > > > > > >>> > > > > > > > > >>> > > > > > > > >>> > > > > > > >>> > > > > > >>> > > > > >>> > > > >>> > > >> > > >> > > > >