The partial crash bug described in the paper looks the same case as what's fixed by ZOOKEEPER-2247. The root cause is the same for both cases (quorum threads were not shutdown).
On Thu, Mar 2, 2017 at 7:45 AM, Rakesh Radhakrishnan <[email protected]> wrote: > Thanks a lot Andrew Purtell for pointing out this. > > I could see, https://issues.apache.org/jira/browse/ZOOKEEPER-2247 jira is > talking about similar case. Could you please go through this jira and let > me know your comments. > > It seems they have used ZooKeeper (v3.4.8) for preparing the report. This > bug is fixed and available only in the latest stable version 3.4.9. > > Thanks, > Rakesh > > On Thu, Mar 2, 2017 at 11:07 AM, Andrew Purtell <[email protected]> > wrote: > > > Is there a JIRA open for the partial crash bug described in "Redundancy > > Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions > > to Single Errors and Corruptions" Aishwarya Ganesan, Ramnatthan > Alagappan, > > Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau, University of > > Wisconsin—Madison. 15th USENIX Conference on File and Storage > Technologies > > (FAST ’17)? > > > > From > > https://www.usenix.org/system/files/conference/fast17/fast17-ganesan.pdf > > > > > > "Unfortunately, ZooKeeper does not recover from write errors to the > > transaction head and log tail. On write errors during log initialization, > > the error handling code tries to gracefully shutdown the node but kills > > only the transaction processing threads; the quorum thread remains alive > > (partial crash). Consequently, other nodes believe that the leader is > > healthy and do not elect a new leader. However, since the leader has > > partially crashed, it cannot propose any transactions, leading to an > > indefinite write unavailability." > > > > > > > > > > -- > > Best regards, > > Andrew Purtell > > [email protected] > > [email protected] > > > -- Cheers Michael.
