[jira] [Commented] (ZOOKEEPER-1936) Server exits when unable to create data directory due to race

2016-07-28 Thread David Arthur (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15397513#comment-15397513
 ] 

David Arthur commented on ZOOKEEPER-1936:
-

Are there any known scenarios that will trigger this race? We've seen it 
intermittently in an EC2 environment, but have yet to figure out why it happens 
there and not other environments. 

Also, any updates on the status of this issue?

> Server exits when unable to create data directory due to race 
> --
>
> Key: ZOOKEEPER-1936
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1936
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.4.6, 3.5.0
>Reporter: Harald Musum
>Assignee: Andrew Purtell
>Priority: Minor
> Attachments: ZOOKEEPER-1936.patch, ZOOKEEPER-1936.v2.patch, 
> ZOOKEEPER-1936.v3.patch, ZOOKEEPER-1936.v3.patch, ZOOKEEPER-1936.v4.patch
>
>
> We sometime see issues with ZooKeeper server not starting and seeing this 
> error in the log:
> [2014-05-27 09:29:48.248] ERROR   : -   
> .org.apache.zookeeper.server.ZooKeeperServerMainUnexpected exception,
> exiting abnormally\nexception=\njava.io.IOException: Unable to create data
> directory /home/y/var/zookeeper/version-2\n\tat
> org.apache.zookeeper.server.persistence.FileTxnSnapLog.(FileTxnSnapLog.java:85)\n\tat
> org.apache.zookeeper.server.ZooKeeperServerMain.runFromConfig(ZooKeeperServerMain.java:103)\n\tat
> org.apache.zookeeper.server.ZooKeeperServerMain.initializeAndRun(ZooKeeperServerMain.java:86)\n\tat
> org.apache.zookeeper.server.ZooKeeperServerMain.main(ZooKeeperServerMain.java:52)\n\tat
> org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:116)\n\tat
> org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:78)\n\t
> [...]
> Stack trace from JVM gives this:
> "PurgeTask" daemon prio=10 tid=0x0201d000 nid=0x1727 runnable
> [0x7f55d7dc7000]
>java.lang.Thread.State: RUNNABLE
> at java.io.UnixFileSystem.createDirectory(Native Method)
> at java.io.File.mkdir(File.java:1310)
> at java.io.File.mkdirs(File.java:1337)
> at
> org.apache.zookeeper.server.persistence.FileTxnSnapLog.(FileTxnSnapLog.java:84)
> at org.apache.zookeeper.server.PurgeTxnLog.purge(PurgeTxnLog.java:68)
> at
> org.apache.zookeeper.server.DatadirCleanupManager$PurgeTask.run(DatadirCleanupManager.java:140)
> at java.util.TimerThread.mainLoop(Timer.java:555)
> at java.util.TimerThread.run(Timer.java:505)
> "zookeeper server" prio=10 tid=0x027df800 nid=0x1715 runnable
> [0x7f55d7ed8000]
>java.lang.Thread.State: RUNNABLE
> at java.io.UnixFileSystem.createDirectory(Native Method)
> at java.io.File.mkdir(File.java:1310)
> at java.io.File.mkdirs(File.java:1337)
> at
> org.apache.zookeeper.server.persistence.FileTxnSnapLog.(FileTxnSnapLog.java:84)
> at
> org.apache.zookeeper.server.ZooKeeperServerMain.runFromConfig(ZooKeeperServerMain.java:103)
> at
> org.apache.zookeeper.server.ZooKeeperServerMain.initializeAndRun(ZooKeeperServerMain.java:86)
> at
> org.apache.zookeeper.server.ZooKeeperServerMain.main(ZooKeeperServerMain.java:52)
> at
> org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:116)
> at
> org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:78)
> [...]
> So it seems that when autopurge is used (as it is in our case), it might 
> happen at the same time as starting the server itself. In FileTxnSnapLog() it 
> will check if the directory exists and create it if not. These two tasks do 
> this at the same time, and mkdir fails and server exits the JVM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-1621) ZooKeeper does not recover from crash when disk was full

2015-02-27 Thread David Arthur (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14340429#comment-14340429
 ] 

David Arthur commented on ZOOKEEPER-1621:
-

I actually like [~shralex]'s suggestion. However, if this is going to be the 
way you recommended recovering a corrupt log file, there should be a script 
that does it for users: zk-recover.sh or some such. In this world of deployment 
automation, it's not a nice thing to say go delete the most recent log segment 
from ZK's data dir. Much better for the application to handle it through a 
script or command.

 ZooKeeper does not recover from crash when disk was full
 

 Key: ZOOKEEPER-1621
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1621
 Project: ZooKeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.4.3
 Environment: Ubuntu 12.04, Amazon EC2 instance
Reporter: David Arthur
Assignee: Michi Mutsuzaki
 Fix For: 3.5.1

 Attachments: ZOOKEEPER-1621.patch, zookeeper.log.gz


 The disk that ZooKeeper was using filled up. During a snapshot write, I got 
 the following exception
 2013-01-16 03:11:14,098 - ERROR [SyncThread:0:SyncRequestProcessor@151] - 
 Severe unrecoverable error, exiting
 java.io.IOException: No space left on device
 at java.io.FileOutputStream.writeBytes(Native Method)
 at java.io.FileOutputStream.write(FileOutputStream.java:282)
 at 
 java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
 at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
 at 
 org.apache.zookeeper.server.persistence.FileTxnLog.commit(FileTxnLog.java:309)
 at 
 org.apache.zookeeper.server.persistence.FileTxnSnapLog.commit(FileTxnSnapLog.java:306)
 at org.apache.zookeeper.server.ZKDatabase.commit(ZKDatabase.java:484)
 at 
 org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestProcessor.java:162)
 at 
 org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:101)
 Then many subsequent exceptions like:
 2013-01-16 15:02:23,984 - ERROR [main:Util@239] - Last transaction was 
 partial.
 2013-01-16 15:02:23,985 - ERROR [main:ZooKeeperServerMain@63] - Unexpected 
 exception, exiting abnormally
 java.io.EOFException
 at java.io.DataInputStream.readInt(DataInputStream.java:375)
 at 
 org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
 at 
 org.apache.zookeeper.server.persistence.FileHeader.deserialize(FileHeader.java:64)
 at 
 org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.inStreamCreated(FileTxnLog.java:558)
 at 
 org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.createInputArchive(FileTxnLog.java:577)
 at 
 org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.goToNextLog(FileTxnLog.java:543)
 at 
 org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.next(FileTxnLog.java:625)
 at 
 org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.init(FileTxnLog.java:529)
 at 
 org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.init(FileTxnLog.java:504)
 at 
 org.apache.zookeeper.server.persistence.FileTxnLog.read(FileTxnLog.java:341)
 at 
 org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:130)
 at 
 org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:223)
 at 
 org.apache.zookeeper.server.ZooKeeperServer.loadData(ZooKeeperServer.java:259)
 at 
 org.apache.zookeeper.server.ZooKeeperServer.startdata(ZooKeeperServer.java:386)
 at 
 org.apache.zookeeper.server.NIOServerCnxnFactory.startup(NIOServerCnxnFactory.java:138)
 at 
 org.apache.zookeeper.server.ZooKeeperServerMain.runFromConfig(ZooKeeperServerMain.java:112)
 at 
 org.apache.zookeeper.server.ZooKeeperServerMain.initializeAndRun(ZooKeeperServerMain.java:86)
 at 
 org.apache.zookeeper.server.ZooKeeperServerMain.main(ZooKeeperServerMain.java:52)
 at 
 org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:116)
 at 
 org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:78)
 It seems to me that writing the transaction log should be fully atomic to 
 avoid such situations. Is this not the case?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (ZOOKEEPER-1621) ZooKeeper does not recover from crash when disk was full

2013-01-16 Thread David Arthur (JIRA)
David Arthur created ZOOKEEPER-1621:
---

 Summary: ZooKeeper does not recover from crash when disk was full
 Key: ZOOKEEPER-1621
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1621
 Project: ZooKeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.4.3
 Environment: Ubuntu 12.04, Amazon EC2 instance
Reporter: David Arthur


The disk that ZooKeeper was using filled up. During a snapshot write, I got the 
following exception

2013-01-16 03:11:14,098 - ERROR [SyncThread:0:SyncRequestProcessor@151] - 
Severe unrecoverable error, exiting
java.io.IOException: No space left on device
at java.io.FileOutputStream.writeBytes(Native Method)
at java.io.FileOutputStream.write(FileOutputStream.java:282)
at 
java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
at 
org.apache.zookeeper.server.persistence.FileTxnLog.commit(FileTxnLog.java:309)
at 
org.apache.zookeeper.server.persistence.FileTxnSnapLog.commit(FileTxnSnapLog.java:306)
at org.apache.zookeeper.server.ZKDatabase.commit(ZKDatabase.java:484)
at 
org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestProcessor.java:162)
at 
org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:101)

Then many subsequent exceptions like:

2013-01-16 15:02:23,984 - ERROR [main:Util@239] - Last transaction was partial.
2013-01-16 15:02:23,985 - ERROR [main:ZooKeeperServerMain@63] - Unexpected 
exception, exiting abnormally
java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:375)
at 
org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
at 
org.apache.zookeeper.server.persistence.FileHeader.deserialize(FileHeader.java:64)
at 
org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.inStreamCreated(FileTxnLog.java:558)
at 
org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.createInputArchive(FileTxnLog.java:577)
at 
org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.goToNextLog(FileTxnLog.java:543)
at 
org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.next(FileTxnLog.java:625)
at 
org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.init(FileTxnLog.java:529)
at 
org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.init(FileTxnLog.java:504)
at 
org.apache.zookeeper.server.persistence.FileTxnLog.read(FileTxnLog.java:341)
at 
org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:130)
at 
org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:223)
at 
org.apache.zookeeper.server.ZooKeeperServer.loadData(ZooKeeperServer.java:259)
at 
org.apache.zookeeper.server.ZooKeeperServer.startdata(ZooKeeperServer.java:386)
at 
org.apache.zookeeper.server.NIOServerCnxnFactory.startup(NIOServerCnxnFactory.java:138)
at 
org.apache.zookeeper.server.ZooKeeperServerMain.runFromConfig(ZooKeeperServerMain.java:112)
at 
org.apache.zookeeper.server.ZooKeeperServerMain.initializeAndRun(ZooKeeperServerMain.java:86)
at 
org.apache.zookeeper.server.ZooKeeperServerMain.main(ZooKeeperServerMain.java:52)
at 
org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:116)
at 
org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:78)


It seems to me that writing the transaction log should be fully atomic to avoid 
such situations. Is this not the case?



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (ZOOKEEPER-1621) ZooKeeper does not recover from crash when disk was full

2013-01-16 Thread David Arthur (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13555110#comment-13555110
 ] 

David Arthur commented on ZOOKEEPER-1621:
-

I was able to workaround the issue by deleting the partially written snapshot 
file

 ZooKeeper does not recover from crash when disk was full
 

 Key: ZOOKEEPER-1621
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1621
 Project: ZooKeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.4.3
 Environment: Ubuntu 12.04, Amazon EC2 instance
Reporter: David Arthur

 The disk that ZooKeeper was using filled up. During a snapshot write, I got 
 the following exception
 2013-01-16 03:11:14,098 - ERROR [SyncThread:0:SyncRequestProcessor@151] - 
 Severe unrecoverable error, exiting
 java.io.IOException: No space left on device
 at java.io.FileOutputStream.writeBytes(Native Method)
 at java.io.FileOutputStream.write(FileOutputStream.java:282)
 at 
 java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
 at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
 at 
 org.apache.zookeeper.server.persistence.FileTxnLog.commit(FileTxnLog.java:309)
 at 
 org.apache.zookeeper.server.persistence.FileTxnSnapLog.commit(FileTxnSnapLog.java:306)
 at org.apache.zookeeper.server.ZKDatabase.commit(ZKDatabase.java:484)
 at 
 org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestProcessor.java:162)
 at 
 org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:101)
 Then many subsequent exceptions like:
 2013-01-16 15:02:23,984 - ERROR [main:Util@239] - Last transaction was 
 partial.
 2013-01-16 15:02:23,985 - ERROR [main:ZooKeeperServerMain@63] - Unexpected 
 exception, exiting abnormally
 java.io.EOFException
 at java.io.DataInputStream.readInt(DataInputStream.java:375)
 at 
 org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
 at 
 org.apache.zookeeper.server.persistence.FileHeader.deserialize(FileHeader.java:64)
 at 
 org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.inStreamCreated(FileTxnLog.java:558)
 at 
 org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.createInputArchive(FileTxnLog.java:577)
 at 
 org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.goToNextLog(FileTxnLog.java:543)
 at 
 org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.next(FileTxnLog.java:625)
 at 
 org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.init(FileTxnLog.java:529)
 at 
 org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.init(FileTxnLog.java:504)
 at 
 org.apache.zookeeper.server.persistence.FileTxnLog.read(FileTxnLog.java:341)
 at 
 org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:130)
 at 
 org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:223)
 at 
 org.apache.zookeeper.server.ZooKeeperServer.loadData(ZooKeeperServer.java:259)
 at 
 org.apache.zookeeper.server.ZooKeeperServer.startdata(ZooKeeperServer.java:386)
 at 
 org.apache.zookeeper.server.NIOServerCnxnFactory.startup(NIOServerCnxnFactory.java:138)
 at 
 org.apache.zookeeper.server.ZooKeeperServerMain.runFromConfig(ZooKeeperServerMain.java:112)
 at 
 org.apache.zookeeper.server.ZooKeeperServerMain.initializeAndRun(ZooKeeperServerMain.java:86)
 at 
 org.apache.zookeeper.server.ZooKeeperServerMain.main(ZooKeeperServerMain.java:52)
 at 
 org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:116)
 at 
 org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:78)
 It seems to me that writing the transaction log should be fully atomic to 
 avoid such situations. Is this not the case?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (ZOOKEEPER-1621) ZooKeeper does not recover from crash when disk was full

2013-01-16 Thread David Arthur (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13555215#comment-13555215
 ] 

David Arthur commented on ZOOKEEPER-1621:
-

Here is the full sequence of events (sorry for the confusion):

* Noticed disk was full
* Cleaned up disk space
* Tried zkCli.sh, got errors
* Checked ZK log, loop of:

2013-01-16 15:01:35,194 - ERROR [main:Util@239] - Last transaction was partial.
2013-01-16 15:01:35,196 - ERROR [main:ZooKeeperServerMain@63] - Unexpected 
exception, exiting abnormally
java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:375)
at 
org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
at 
org.apache.zookeeper.server.persistence.FileHeader.deserialize(FileHeader.java:64)
at 
org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.inStreamCreated(FileTxnLog.java:558)
at 
org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.createInputArchive(FileTxnLog.java:577)
at 
org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.goToNextLog(FileTxnLog.java:543)
at 
org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.next(FileTxnLog.java:625)
at 
org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.init(FileTxnLog.java:529)
at 
org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.init(FileTxnLog.java:504)
at 
org.apache.zookeeper.server.persistence.FileTxnLog.read(FileTxnLog.java:341)
at 
org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:130)
at 
org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:223)
at 
org.apache.zookeeper.server.ZooKeeperServer.loadData(ZooKeeperServer.java:259)
at 
org.apache.zookeeper.server.ZooKeeperServer.startdata(ZooKeeperServer.java:386)
at 
org.apache.zookeeper.server.NIOServerCnxnFactory.startup(NIOServerCnxnFactory.java:138)
at 
org.apache.zookeeper.server.ZooKeeperServerMain.runFromConfig(ZooKeeperServerMain.java:112)
at 
org.apache.zookeeper.server.ZooKeeperServerMain.initializeAndRun(ZooKeeperServerMain.java:86)
at 
org.apache.zookeeper.server.ZooKeeperServerMain.main(ZooKeeperServerMain.java:52)
at 
org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:116)
at 
org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:78)

* Stopped ZK
* Listed ZK data directory

ubuntu@ip-10-78-19-254:/opt/zookeeper-3.4.3/data/version-2$ ls -lat
total 18096
drwxr-xr-x 2 zookeeper zookeeper 4096 Jan 16 06:41 .
-rw-r--r-- 1 zookeeper zookeeper0 Jan 16 06:41 log.19a3e
-rw-r--r-- 1 zookeeper zookeeper   585377 Jan 16 06:41 snapshot.19a3d
-rw-r--r-- 1 zookeeper zookeeper 67108880 Jan 16 03:11 log.19a2a
-rw-r--r-- 1 zookeeper zookeeper   585911 Jan 16 03:11 snapshot.19a29
-rw-r--r-- 1 zookeeper zookeeper 67108880 Jan 16 03:11 log.11549
-rw-r--r-- 1 zookeeper zookeeper   585190 Jan 15 17:28 snapshot.11547
-rw-r--r-- 1 zookeeper zookeeper 67108880 Jan 15 17:28 log.1
-rw-r--r-- 1 zookeeper zookeeper  296 Jan 14 16:44 snapshot.0
drwxr-xr-x 3 zookeeper zookeeper 4096 Jan 14 16:44 ..

* Removed log.19a3e and snapshot.19a3d

ubuntu@ip-10-78-19-254:/opt/zookeeper-3.4.3/data/version-2$ sudo rm log.19a3e
ubuntu@ip-10-78-19-254:/opt/zookeeper-3.4.3/data/version-2$ sudo rm 
snapshot.19a3d

* Started ZK
* Back to normal

 ZooKeeper does not recover from crash when disk was full
 

 Key: ZOOKEEPER-1621
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1621
 Project: ZooKeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.4.3
 Environment: Ubuntu 12.04, Amazon EC2 instance
Reporter: David Arthur
 Fix For: 3.5.0

 Attachments: zookeeper.log.gz


 The disk that ZooKeeper was using filled up. During a snapshot write, I got 
 the following exception
 2013-01-16 03:11:14,098 - ERROR [SyncThread:0:SyncRequestProcessor@151] - 
 Severe unrecoverable error, exiting
 java.io.IOException: No space left on device
 at java.io.FileOutputStream.writeBytes(Native Method)
 at java.io.FileOutputStream.write(FileOutputStream.java:282)
 at 
 java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
 at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
 at 
 org.apache.zookeeper.server.persistence.FileTxnLog.commit(FileTxnLog.java:309)
 at 
 org.apache.zookeeper.server.persistence.FileTxnSnapLog.commit(FileTxnSnapLog.java:306)
 at org.apache.zookeeper.server.ZKDatabase.commit(ZKDatabase.java:484)
 at 
 

[jira] [Updated] (ZOOKEEPER-1621) ZooKeeper does not recover from crash when disk was full

2013-01-16 Thread David Arthur (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Arthur updated ZOOKEEPER-1621:


Attachment: zookeeper.log.gz

Attaching zookeeper.log

 ZooKeeper does not recover from crash when disk was full
 

 Key: ZOOKEEPER-1621
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1621
 Project: ZooKeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.4.3
 Environment: Ubuntu 12.04, Amazon EC2 instance
Reporter: David Arthur
 Fix For: 3.5.0

 Attachments: zookeeper.log.gz


 The disk that ZooKeeper was using filled up. During a snapshot write, I got 
 the following exception
 2013-01-16 03:11:14,098 - ERROR [SyncThread:0:SyncRequestProcessor@151] - 
 Severe unrecoverable error, exiting
 java.io.IOException: No space left on device
 at java.io.FileOutputStream.writeBytes(Native Method)
 at java.io.FileOutputStream.write(FileOutputStream.java:282)
 at 
 java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
 at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
 at 
 org.apache.zookeeper.server.persistence.FileTxnLog.commit(FileTxnLog.java:309)
 at 
 org.apache.zookeeper.server.persistence.FileTxnSnapLog.commit(FileTxnSnapLog.java:306)
 at org.apache.zookeeper.server.ZKDatabase.commit(ZKDatabase.java:484)
 at 
 org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestProcessor.java:162)
 at 
 org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:101)
 Then many subsequent exceptions like:
 2013-01-16 15:02:23,984 - ERROR [main:Util@239] - Last transaction was 
 partial.
 2013-01-16 15:02:23,985 - ERROR [main:ZooKeeperServerMain@63] - Unexpected 
 exception, exiting abnormally
 java.io.EOFException
 at java.io.DataInputStream.readInt(DataInputStream.java:375)
 at 
 org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
 at 
 org.apache.zookeeper.server.persistence.FileHeader.deserialize(FileHeader.java:64)
 at 
 org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.inStreamCreated(FileTxnLog.java:558)
 at 
 org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.createInputArchive(FileTxnLog.java:577)
 at 
 org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.goToNextLog(FileTxnLog.java:543)
 at 
 org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.next(FileTxnLog.java:625)
 at 
 org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.init(FileTxnLog.java:529)
 at 
 org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.init(FileTxnLog.java:504)
 at 
 org.apache.zookeeper.server.persistence.FileTxnLog.read(FileTxnLog.java:341)
 at 
 org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:130)
 at 
 org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:223)
 at 
 org.apache.zookeeper.server.ZooKeeperServer.loadData(ZooKeeperServer.java:259)
 at 
 org.apache.zookeeper.server.ZooKeeperServer.startdata(ZooKeeperServer.java:386)
 at 
 org.apache.zookeeper.server.NIOServerCnxnFactory.startup(NIOServerCnxnFactory.java:138)
 at 
 org.apache.zookeeper.server.ZooKeeperServerMain.runFromConfig(ZooKeeperServerMain.java:112)
 at 
 org.apache.zookeeper.server.ZooKeeperServerMain.initializeAndRun(ZooKeeperServerMain.java:86)
 at 
 org.apache.zookeeper.server.ZooKeeperServerMain.main(ZooKeeperServerMain.java:52)
 at 
 org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:116)
 at 
 org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:78)
 It seems to me that writing the transaction log should be fully atomic to 
 avoid such situations. Is this not the case?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira