[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15616950#comment-15616950
 ] 

Michael Han commented on ZOOKEEPER-1621:
----------------------------------------

The proposal of the fix makes sense to me.

Is it feasible to make a stronger guarantee for the ZooKeeper serialization 
semantics - that is, under no cases (disk full, power failure, hardware 
failure) would ZooKeeper generates invalid persistent files (for both snapshot 
and tx logs)? This might be possible by serializing things to a swap file first 
and then at one point do an atomic rename of the file. With the guarantee of 
the sanity of the on disk formats the deserializing logic would be simplified, 
as there will not be many corner cases to consider, besides the existing basic 
checksum check logic. 

I can think two potential drawback of this approach:
* Performance: if we write to swap file and then rename for every writes, we 
will be making more sys calls per write. Might impact performance / latency of 
write?
* Potential data loss during recover: to improve performance, we could batch 
writes and only do rename at certain points - (i.e. every 1000 writes). In case 
of a failure, part of the data might loss as those data (possibly corrupted / 
partially serialized) living in swap file will not be parsed by ZK during start 
up (we will only load and parse renamed files.).

My feeling is the best approach might be a mix of efforts on both serialization 
and deserialization side:
* When serializing, we do our best efforts to avoid generate corrupted files 
(i.e. through atomic writes to files.).
* When deserializing, we do best efforts to detect corrupt files and recover 
conservatively - the success of recovery might be case by case - for example 
for this disk full case the proposed fix sounds pretty safe to perform while in 
other cases it might not be straightforward to tell which data is good and 
which is bad.
* As a result - the expectation is when things crash and files corrupted, ZK 
should be able to recover later without manual intervention. This would be good 
for users.


> ZooKeeper does not recover from crash when disk was full
> --------------------------------------------------------
>
>                 Key: ZOOKEEPER-1621
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1621
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 3.4.3
>         Environment: Ubuntu 12.04, Amazon EC2 instance
>            Reporter: David Arthur
>            Assignee: Michi Mutsuzaki
>             Fix For: 3.5.3, 3.6.0
>
>         Attachments: ZOOKEEPER-1621.patch, zookeeper.log.gz
>
>
> The disk that ZooKeeper was using filled up. During a snapshot write, I got 
> the following exception
> 2013-01-16 03:11:14,098 - ERROR [SyncThread:0:SyncRequestProcessor@151] - 
> Severe unrecoverable error, exiting
> java.io.IOException: No space left on device
>         at java.io.FileOutputStream.writeBytes(Native Method)
>         at java.io.FileOutputStream.write(FileOutputStream.java:282)
>         at 
> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
>         at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
>         at 
> org.apache.zookeeper.server.persistence.FileTxnLog.commit(FileTxnLog.java:309)
>         at 
> org.apache.zookeeper.server.persistence.FileTxnSnapLog.commit(FileTxnSnapLog.java:306)
>         at org.apache.zookeeper.server.ZKDatabase.commit(ZKDatabase.java:484)
>         at 
> org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestProcessor.java:162)
>         at 
> org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:101)
> Then many subsequent exceptions like:
> 2013-01-16 15:02:23,984 - ERROR [main:Util@239] - Last transaction was 
> partial.
> 2013-01-16 15:02:23,985 - ERROR [main:ZooKeeperServerMain@63] - Unexpected 
> exception, exiting abnormally
> java.io.EOFException
>         at java.io.DataInputStream.readInt(DataInputStream.java:375)
>         at 
> org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
>         at 
> org.apache.zookeeper.server.persistence.FileHeader.deserialize(FileHeader.java:64)
>         at 
> org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.inStreamCreated(FileTxnLog.java:558)
>         at 
> org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.createInputArchive(FileTxnLog.java:577)
>         at 
> org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.goToNextLog(FileTxnLog.java:543)
>         at 
> org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.next(FileTxnLog.java:625)
>         at 
> org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.init(FileTxnLog.java:529)
>         at 
> org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.<init>(FileTxnLog.java:504)
>         at 
> org.apache.zookeeper.server.persistence.FileTxnLog.read(FileTxnLog.java:341)
>         at 
> org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:130)
>         at 
> org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:223)
>         at 
> org.apache.zookeeper.server.ZooKeeperServer.loadData(ZooKeeperServer.java:259)
>         at 
> org.apache.zookeeper.server.ZooKeeperServer.startdata(ZooKeeperServer.java:386)
>         at 
> org.apache.zookeeper.server.NIOServerCnxnFactory.startup(NIOServerCnxnFactory.java:138)
>         at 
> org.apache.zookeeper.server.ZooKeeperServerMain.runFromConfig(ZooKeeperServerMain.java:112)
>         at 
> org.apache.zookeeper.server.ZooKeeperServerMain.initializeAndRun(ZooKeeperServerMain.java:86)
>         at 
> org.apache.zookeeper.server.ZooKeeperServerMain.main(ZooKeeperServerMain.java:52)
>         at 
> org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:116)
>         at 
> org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:78)
> It seems to me that writing the transaction log should be fully atomic to 
> avoid such situations. Is this not the case?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to