[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15505015#comment-15505015
 ] 

Chris Nauroth commented on ZOOKEEPER-1936:
------------------------------------------

This patch has been stalled, because there is no easy way to write a unit test 
for it, and no one has been able to produce a consistent repro in a live 
environment.

I have good news.  I was able to find a consistent repro with an environment 
that could reproduce the problem in approximately 80% of ZooKeeper server 
starts.  FWIW, the OS was SUSE11sp3, and it was running ZooKeeper 3.4.6.  I 
applied the v3 patch, deployed it in this environment, and we could no longer 
repro.

Based on successful manual testing, I am now +1 to commit patch v4 to trunk and 
branch-3.5, and commit patch v3 to branch-3.4.  I will wait until later in the 
week in case other committers who have been watching the issue would like to 
discuss further.

> Server exits when unable to create data directory due to race 
> --------------------------------------------------------------
>
>                 Key: ZOOKEEPER-1936
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1936
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 3.4.6, 3.5.0
>            Reporter: Harald Musum
>            Assignee: Andrew Purtell
>            Priority: Minor
>         Attachments: ZOOKEEPER-1936.patch, ZOOKEEPER-1936.v2.patch, 
> ZOOKEEPER-1936.v3.patch, ZOOKEEPER-1936.v3.patch, ZOOKEEPER-1936.v4.patch
>
>
> We sometime see issues with ZooKeeper server not starting and seeing this 
> error in the log:
> [2014-05-27 09:29:48.248] ERROR   : -               
> .org.apache.zookeeper.server.ZooKeeperServerMain    Unexpected exception,
> exiting abnormally\nexception=\njava.io.IOException: Unable to create data
> directory /home/y/var/zookeeper/version-2\n\tat
> org.apache.zookeeper.server.persistence.FileTxnSnapLog.<init>(FileTxnSnapLog.java:85)\n\tat
> org.apache.zookeeper.server.ZooKeeperServerMain.runFromConfig(ZooKeeperServerMain.java:103)\n\tat
> org.apache.zookeeper.server.ZooKeeperServerMain.initializeAndRun(ZooKeeperServerMain.java:86)\n\tat
> org.apache.zookeeper.server.ZooKeeperServerMain.main(ZooKeeperServerMain.java:52)\n\tat
> org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:116)\n\tat
> org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:78)\n\t
> [...]
> Stack trace from JVM gives this:
> "PurgeTask" daemon prio=10 tid=0x000000000201d000 nid=0x1727 runnable
> [0x00007f55d7dc7000]
>    java.lang.Thread.State: RUNNABLE
>     at java.io.UnixFileSystem.createDirectory(Native Method)
>     at java.io.File.mkdir(File.java:1310)
>     at java.io.File.mkdirs(File.java:1337)
>     at
> org.apache.zookeeper.server.persistence.FileTxnSnapLog.<init>(FileTxnSnapLog.java:84)
>     at org.apache.zookeeper.server.PurgeTxnLog.purge(PurgeTxnLog.java:68)
>     at
> org.apache.zookeeper.server.DatadirCleanupManager$PurgeTask.run(DatadirCleanupManager.java:140)
>     at java.util.TimerThread.mainLoop(Timer.java:555)
>     at java.util.TimerThread.run(Timer.java:505)
> "zookeeper server" prio=10 tid=0x00000000027df800 nid=0x1715 runnable
> [0x00007f55d7ed8000]
>    java.lang.Thread.State: RUNNABLE
>     at java.io.UnixFileSystem.createDirectory(Native Method)
>     at java.io.File.mkdir(File.java:1310)
>     at java.io.File.mkdirs(File.java:1337)
>     at
> org.apache.zookeeper.server.persistence.FileTxnSnapLog.<init>(FileTxnSnapLog.java:84)
>     at
> org.apache.zookeeper.server.ZooKeeperServerMain.runFromConfig(ZooKeeperServerMain.java:103)
>     at
> org.apache.zookeeper.server.ZooKeeperServerMain.initializeAndRun(ZooKeeperServerMain.java:86)
>     at
> org.apache.zookeeper.server.ZooKeeperServerMain.main(ZooKeeperServerMain.java:52)
>     at
> org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:116)
>     at
> org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:78)
> [...]
> So it seems that when autopurge is used (as it is in our case), it might 
> happen at the same time as starting the server itself. In FileTxnSnapLog() it 
> will check if the directory exists and create it if not. These two tasks do 
> this at the same time, and mkdir fails and server exits the JVM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to