Hi,

The bug described in  ZOOKEEPER-2959
<https://issues.apache.org/jira/browse/ZOOKEEPER-2959>  is that
getEpochToPropose an waitForEpochAck do not distinguish between followers
and observers.
This can cause a candidate leader's acceptedEpoch to be updated with only
support from observers. Same for waitForEpochAck - passing this method
allows the candidate leader to update the currentEpoch. The latter helps
this server to win FLE elections continuously, and the former
(acceptedEpoch)
causes anyone trying to connect to the server to think that it has more
up-to-date data and trucate their logs to match.


Alex

On Fri, Apr 6, 2018 at 10:04 AM, Fangmin Lv <lvfang...@gmail.com> wrote:

> Hi Alex,
>
> Can you give more details about the data loss scenario in Jira
> ZOOKEEPER-2959 <https://issues.apache.org/jira/browse/ZOOKEEPER-2959>? As
> far as I know, the leader will ignore the observers' ACK in
> waitForNewLeaderAck, so it will not start serve traffic until it received
> the actual quorum ACK, if it doesn't have enough followers support before
> timeout, it will quit leading and it's learners will re-sync with new
> leader.
>
> Thanks,
> Fangmin
>
> On Thu, Apr 5, 2018 at 12:57 PM, Alexander Shraer <shra...@gmail.com>
> wrote:
>
>> Btw we actually observed the described issue (data loss), thankfully in a
>> test environment. So I thought this is important to share with the
>> community.
>>
>> Unfortunately I don’t have time to run a new ZK release for this, so I’m
>> not going to -1 your candidate, but we are actively working on a fix (ie a
>> test at this point) and I can commit that as soon as we have that.
>>
>> It may be worth while to delay the release by a few more days, but it’s
>> totally up to you since you’re running it.
>>
>> Cheers
>> Alex
>> On Thu, Apr 5, 2018 at 12:47 PM Andor Molnar <an...@cloudera.com> wrote:
>>
>> > Got that. I still believe it's a completely valid issue which has to be
>> > addressed, but it's not a showstopper. I'm afraid we're not going to
>> > convince each other, so it's probably Abe's call if he want to create
>> > another release candidate for the fix.
>> >
>> > I reviewed the code on github and I think it just needs to be covered
>> with
>> > a unit test to be complete.
>> >
>> > Regards,
>> > Andor
>> >
>> >
>> >
>> > On Thu, Apr 5, 2018 at 9:05 PM, Alexander Shraer <shra...@gmail.com>
>> > wrote:
>> >
>> > > Yes sort of, FLE is finished, then enough observer's messages reach
>> the
>> > > leader before participant's messages do.
>> > > Whether its rare depends on the number of observers and participants.
>> For
>> > > example with very few participants and many observers
>> > > your chance of hitting this are quite high.
>> > >
>> > > Alex
>> > >
>> > > On Thu, Apr 5, 2018 at 11:44 AM, Andor Molnar <an...@cloudera.com>
>> > wrote:
>> > >
>> > > > Maybe I'm missing something here, but this looks like a rare edge
>> case
>> > to
>> > > > me. Participants must finish the leader election successfully and
>> right
>> > > > after enough followers should fail to send epoch to the leader, so
>> > > > observers can take it over.
>> > > >
>> > > > Is that description accurate?
>> > > >
>> > > > Andor
>> > > >
>> > > >
>> > > > On Thu, Apr 5, 2018 at 7:35 PM, Alexander Shraer <shra...@gmail.com
>> >
>> > > > wrote:
>> > > >
>> > > > > To clarify - in a deployment with observers this bug can
>> potentially
>> > > > cause
>> > > > > data loss. A server could be elected leader based just on the
>> support
>> > > of
>> > > > > observers, even if this servers data is stale wrt other followers.
>> > > > >
>> > > > > It is certainly a blocker, just not sure if for 3.4.11 or 3.4.12.
>> > > > >
>> > > > >
>> > > > > Alex
>> > > > > On Thu, Apr 5, 2018 at 10:29 AM Andor Molnar <an...@cloudera.com>
>> > > wrote:
>> > > > >
>> > > > > > I don't think it's a blocker.
>> > > > > > The jira and PR has been open since last December and 3.4.11 has
>> > > > released
>> > > > > > without it.
>> > > > > >
>> > > > > > Although this bug is also important to fix, I believe it's more
>> > > > important
>> > > > > > to release a fix for the regression we've found in 3.4.11 asap.
>> > > > > >
>> > > > > > Abe, any thoughts?
>> > > > > >
>> > > > > > Regards,
>> > > > > > Andor
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > > On Thu, Apr 5, 2018 at 7:00 PM, Alexander Shraer <
>> > shra...@gmail.com>
>> > > > > > wrote:
>> > > > > >
>> > > > > > > Sorry for coming in at the last moment. I'm not sure when the
>> > next
>> > > > 3.4
>> > > > > > > release is scheduled, so just wanted to mention this bug,
>> > > > > > > which I believe is a blocker for either this or next release:
>> > > > > > > https://issues.apache.org/jira/browse/ZOOKEEPER-2959
>> > > > > > >
>> > > > > > > Best,
>> > > > > > > Alex
>> > > > > > >
>> > > > > > > On Thu, Apr 5, 2018 at 9:09 AM, Ted Yu <yuzhih...@gmail.com>
>> > > wrote:
>> > > > > > >
>> > > > > > > > Can the vote be closed ?
>> > > > > > > >
>> > > > > > > > It seems we have enough +1's
>> > > > > > > >
>> > > > > > > > Thanks
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>
>

Reply via email to