Re: Partial crash bug described in Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions (FAST17)

Michael Han Thu, 02 Mar 2017 11:26:43 -0800

The partial crash bug described in the paper looks the same case as what's
fixed by ZOOKEEPER-2247. The root cause is the same for both cases (quorum
threads were not shutdown).


On Thu, Mar 2, 2017 at 7:45 AM, Rakesh Radhakrishnan <[email protected]>
wrote:

> Thanks a lot Andrew Purtell for pointing out this.
>
> I could see, https://issues.apache.org/jira/browse/ZOOKEEPER-2247 jira is
> talking about similar case. Could you please go through this jira and let
> me know your comments.
>
> It seems they have used ZooKeeper (v3.4.8) for preparing the report. This
> bug is fixed and available only in the latest stable version 3.4.9.
>
> Thanks,
> Rakesh
>
> On Thu, Mar 2, 2017 at 11:07 AM, Andrew Purtell <[email protected]>
> wrote:
>
> > Is there a JIRA open for the partial crash bug described in "Redundancy
> > Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions
> > to Single Errors and Corruptions" Aishwarya Ganesan, Ramnatthan
> Alagappan,
> > Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau, University of
> > Wisconsin—Madison. 15th USENIX Conference on File and Storage
> Technologies
> > (FAST ’17)?
> >
> > From
> > https://www.usenix.org/system/files/conference/fast17/fast17-ganesan.pdf
> >
> >
> > "Unfortunately, ZooKeeper does not recover from write errors to the
> > transaction head and log tail. On write errors during log initialization,
> > the error handling code tries to gracefully shutdown the node but kills
> > only the transaction processing threads; the quorum thread remains alive
> > (partial crash). Consequently, other nodes believe that the leader is
> > healthy and do not elect a new leader. However, since the leader has
> > partially crashed, it cannot propose any transactions, leading to an
> > indefinite write unavailability."
> >
> >
> >
> >
> > --
> > Best regards,
> > Andrew Purtell
> > [email protected]
> > [email protected]
> >
>



-- 
Cheers
Michael.

Re: Partial crash bug described in Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions (FAST17)

Reply via email to