[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13399853#comment-13399853
 ] 

Marshall McMullen commented on ZOOKEEPER-1453:
----------------------------------------------

Patrick, thanks for taking the time to explain.. I read throw the other bug and 
your explanation is very clear. I'd like to work on a fix for this as it's 
hitting us very frequently with a stress test we do where we continually reboot 
one of our machines that is hosting one of our zk servers. Anyhow, I'm looking 
at the FileTxnIterator code, and I definitely see the bug in next() method in 
that it always assumes EOF is success. Have you given thought to the right 
solution here? Maybe giving precedence to validating CRC before checking for 
EOF? 

What do you think about this:

public boolean next() throws IOException {
    if (ia == null) {
        return false;
    }
    try {
        long crcValue = ia.readLong("crcvalue");
        byte[] bytes = Util.readTxnBytes(ia);
        // validate CRC
        Checksum crc = makeChecksumAlgorithm();
        if (bytes) {
            crc.update(bytes, 0, bytes.length);
        }
        if (crcValue != crc.getValue())
            throw new IOException(CRC_ERROR);
        if (bytes == null || bytes.length == 0)
            throw new EOFException("Failed to read " + logFile);
        hdr = new TxnHeader();
        record = SerializeUtils.deserializeTxn(bytes, hdr);
    } catch (EOFException e) {
    ...
                
> corrupted logs may not be correctly identified by FileTxnIterator
> -----------------------------------------------------------------
>
>                 Key: ZOOKEEPER-1453
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1453
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 3.3.3
>            Reporter: Patrick Hunt
>            Priority: Critical
>
> See ZOOKEEPER-1449 for background on this issue. The main problem is that 
> during server recovery 
> org.apache.zookeeper.server.persistence.FileTxnLog.FileTxnIterator.next() 
> does not indicate if the available logs are valid or not. In some cases (say 
> a truncated record and a single txnlog in the datadir) we will not detect 
> that the file is corrupt, vs reaching the end of the file.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to