Now that we have the fix, why delay it to next release?

On Fri, Apr 13, 2018 at 11:09 AM Abraham Fine <af...@apache.org> wrote:

> Let's wait until the next release to include this fix.
>
> On Mon, Apr 9, 2018, at 15:14, Alexander Shraer wrote:
> > Hi,
> >
> > Please take a look on the new PR for ZK-2959:
> > https://github.com/apache/zookeeper/pull/500
> > If there are no further comments, I can commit it.
> >
> > Thanks,
> > Alex
> >
> > On Fri, Apr 6, 2018 at 11:33 AM, Alexander Shraer <shra...@gmail.com>
> wrote:
> >
> > > Hi,
> > >
> > > The bug described in  ZOOKEEPER-2959
> > > <https://issues.apache.org/jira/browse/ZOOKEEPER-2959>  is that
> > > getEpochToPropose an waitForEpochAck do not distinguish between
> followers
> > > and observers.
> > > This can cause a candidate leader's acceptedEpoch to be updated with
> only
> > > support from observers. Same for waitForEpochAck - passing this method
> > > allows the candidate leader to update the currentEpoch. The latter
> helps
> > > this server to win FLE elections continuously, and the former
> > > (acceptedEpoch)
> > > causes anyone trying to connect to the server to think that it has more
> > > up-to-date data and trucate their logs to match.
> > >
> > >
> > > Alex
> > >
> > > On Fri, Apr 6, 2018 at 10:04 AM, Fangmin Lv <lvfang...@gmail.com>
> wrote:
> > >
> > >> Hi Alex,
> > >>
> > >> Can you give more details about the data loss scenario in Jira
> > >> ZOOKEEPER-2959 <https://issues.apache.org/jira/browse/ZOOKEEPER-2959
> >?
> > >> As far as I know, the leader will ignore the observers' ACK in
> > >> waitForNewLeaderAck, so it will not start serve traffic until it
> received
> > >> the actual quorum ACK, if it doesn't have enough followers support
> before
> > >> timeout, it will quit leading and it's learners will re-sync with new
> > >> leader.
> > >>
> > >> Thanks,
> > >> Fangmin
> > >>
> > >> On Thu, Apr 5, 2018 at 12:57 PM, Alexander Shraer <shra...@gmail.com>
> > >> wrote:
> > >>
> > >>> Btw we actually observed the described issue (data loss), thankfully
> in a
> > >>> test environment. So I thought this is important to share with the
> > >>> community.
> > >>>
> > >>> Unfortunately I don’t have time to run a new ZK release for this, so
> I’m
> > >>> not going to -1 your candidate, but we are actively working on a fix
> (ie
> > >>> a
> > >>> test at this point) and I can commit that as soon as we have that.
> > >>>
> > >>> It may be worth while to delay the release by a few more days, but
> it’s
> > >>> totally up to you since you’re running it.
> > >>>
> > >>> Cheers
> > >>> Alex
> > >>> On Thu, Apr 5, 2018 at 12:47 PM Andor Molnar <an...@cloudera.com>
> wrote:
> > >>>
> > >>> > Got that. I still believe it's a completely valid issue which has
> to be
> > >>> > addressed, but it's not a showstopper. I'm afraid we're not going
> to
> > >>> > convince each other, so it's probably Abe's call if he want to
> create
> > >>> > another release candidate for the fix.
> > >>> >
> > >>> > I reviewed the code on github and I think it just needs to be
> covered
> > >>> with
> > >>> > a unit test to be complete.
> > >>> >
> > >>> > Regards,
> > >>> > Andor
> > >>> >
> > >>> >
> > >>> >
> > >>> > On Thu, Apr 5, 2018 at 9:05 PM, Alexander Shraer <
> shra...@gmail.com>
> > >>> > wrote:
> > >>> >
> > >>> > > Yes sort of, FLE is finished, then enough observer's messages
> reach
> > >>> the
> > >>> > > leader before participant's messages do.
> > >>> > > Whether its rare depends on the number of observers and
> > >>> participants. For
> > >>> > > example with very few participants and many observers
> > >>> > > your chance of hitting this are quite high.
> > >>> > >
> > >>> > > Alex
> > >>> > >
> > >>> > > On Thu, Apr 5, 2018 at 11:44 AM, Andor Molnar <
> an...@cloudera.com>
> > >>> > wrote:
> > >>> > >
> > >>> > > > Maybe I'm missing something here, but this looks like a rare
> edge
> > >>> case
> > >>> > to
> > >>> > > > me. Participants must finish the leader election successfully
> and
> > >>> right
> > >>> > > > after enough followers should fail to send epoch to the
> leader, so
> > >>> > > > observers can take it over.
> > >>> > > >
> > >>> > > > Is that description accurate?
> > >>> > > >
> > >>> > > > Andor
> > >>> > > >
> > >>> > > >
> > >>> > > > On Thu, Apr 5, 2018 at 7:35 PM, Alexander Shraer <
> > >>> shra...@gmail.com>
> > >>> > > > wrote:
> > >>> > > >
> > >>> > > > > To clarify - in a deployment with observers this bug can
> > >>> potentially
> > >>> > > > cause
> > >>> > > > > data loss. A server could be elected leader based just on the
> > >>> support
> > >>> > > of
> > >>> > > > > observers, even if this servers data is stale wrt other
> > >>> followers.
> > >>> > > > >
> > >>> > > > > It is certainly a blocker, just not sure if for 3.4.11 or
> 3.4.12.
> > >>> > > > >
> > >>> > > > >
> > >>> > > > > Alex
> > >>> > > > > On Thu, Apr 5, 2018 at 10:29 AM Andor Molnar <
> an...@cloudera.com
> > >>> >
> > >>> > > wrote:
> > >>> > > > >
> > >>> > > > > > I don't think it's a blocker.
> > >>> > > > > > The jira and PR has been open since last December and
> 3.4.11
> > >>> has
> > >>> > > > released
> > >>> > > > > > without it.
> > >>> > > > > >
> > >>> > > > > > Although this bug is also important to fix, I believe it's
> more
> > >>> > > > important
> > >>> > > > > > to release a fix for the regression we've found in 3.4.11
> asap.
> > >>> > > > > >
> > >>> > > > > > Abe, any thoughts?
> > >>> > > > > >
> > >>> > > > > > Regards,
> > >>> > > > > > Andor
> > >>> > > > > >
> > >>> > > > > >
> > >>> > > > > >
> > >>> > > > > > On Thu, Apr 5, 2018 at 7:00 PM, Alexander Shraer <
> > >>> > shra...@gmail.com>
> > >>> > > > > > wrote:
> > >>> > > > > >
> > >>> > > > > > > Sorry for coming in at the last moment. I'm not sure
> when the
> > >>> > next
> > >>> > > > 3.4
> > >>> > > > > > > release is scheduled, so just wanted to mention this bug,
> > >>> > > > > > > which I believe is a blocker for either this or next
> release:
> > >>> > > > > > > https://issues.apache.org/jira/browse/ZOOKEEPER-2959
> > >>> > > > > > >
> > >>> > > > > > > Best,
> > >>> > > > > > > Alex
> > >>> > > > > > >
> > >>> > > > > > > On Thu, Apr 5, 2018 at 9:09 AM, Ted Yu <
> yuzhih...@gmail.com>
> > >>> > > wrote:
> > >>> > > > > > >
> > >>> > > > > > > > Can the vote be closed ?
> > >>> > > > > > > >
> > >>> > > > > > > > It seems we have enough +1's
> > >>> > > > > > > >
> > >>> > > > > > > > Thanks
> > >>> > > > > > > >
> > >>> > > > > > >
> > >>> > > > > >
> > >>> > > > >
> > >>> > > >
> > >>> > >
> > >>> >
> > >>>
> > >>
> > >>
> > >
>

Reply via email to