[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13054239#comment-13054239
 ] 

Laxman commented on ZOOKEEPER-1109:
-----------------------------------

Reposting the comments and analysis

I've also gone through Ted's earlier response on disk full scenario.
http://www.google.co.in/url?sa=t&source=web&cd=3&ved=0CCAQFjAC&url=http%3A%2F%2Fmail-archives.apache.org%2Fmod_mbox%2Fzookeeper-user%2F201106.mbox%2F%253CBANLkTimzQjXZvDKnP6xQLF9jHfsaz6JstA%40mail.gmail.com%253E&ei=FBQETvPWIcLNrQfk75yaDA&usg=AFQjCNFTkguyxTligpz1TZBmkqe9Osz-uw

We feel, even when one of the cluster member's disk is full, we should not 
interrupt the complete service from entire cluster.

*Thread dumps*

The following thread dump shows the QuorumPeerMain thread is infntely waiting 
inside SyncRequestProcessor. 

{noformat}
"Thread-2" prio=10 tid=0x0810a400 nid=0x1695 in Object.wait() [0xac783000] 
   java.lang.Thread.State: WAITING (on object monitor) 
        at java.lang.Object.wait(Native Method) 
        - waiting on <0xb804f5e8> (a 
org.apache.zookeeper.server.SyncRequestProcessor) 
        at java.lang.Thread.join(Thread.java:1143) 
        - locked <0xb804f5e8> (a 
org.apache.zookeeper.server.SyncRequestProcessor) 
        at java.lang.Thread.join(Thread.java:1196) 
        at 
org.apache.zookeeper.server.SyncRequestProcessor.shutdown(SyncRequestProcessor.java:171)
 
        at 
org.apache.zookeeper.server.quorum.ProposalRequestProcessor.shutdown(ProposalRequestProcessor.java:79)
 
        at 
org.apache.zookeeper.server.PrepRequestProcessor.shutdown(PrepRequestProcessor.java:513)
 
        at 
org.apache.zookeeper.server.ZooKeeperServer.shutdown(ZooKeeperServer.java:413) 
        at org.apache.zookeeper.server.quorum.Leader.shutdown(Leader.java:411) 
        at 
org.apache.zookeeper.server.quorum.QuorumPeer.shutdown(QuorumPeer.java:694) 
        at 
org.apache.zookeeper.server.quorum.QuorumPeerMain$1.run(QuorumPeerMain.java:126)
 

"SyncThread:2" prio=10 tid=0xad7fd800 nid=0x4acb in Object.wait() [0xac9ba000] 
   java.lang.Thread.State: WAITING (on object monitor) 
        at java.lang.Object.wait(Native Method) 
        - waiting on <0xb8030d00> (a 
org.apache.zookeeper.server.quorum.QuorumPeerMain$1) 
        at java.lang.Thread.join(Thread.java:1143) 
        - locked <0xb8030d00> (a 
org.apache.zookeeper.server.quorum.QuorumPeerMain$1) 
        at java.lang.Thread.join(Thread.java:1196) 
        at 
java.lang.ApplicationShutdownHooks.runHooks(ApplicationShutdownHooks.java:79) 
        at 
java.lang.ApplicationShutdownHooks$1.run(ApplicationShutdownHooks.java:24) 
        at java.lang.Shutdown.runHooks(Shutdown.java:79) 
        at java.lang.Shutdown.sequence(Shutdown.java:123) 
        at java.lang.Shutdown.exit(Shutdown.java:168) 
        - locked <0xf01ff3b0> (a java.lang.Class for java.lang.Shutdown) 
        at java.lang.Runtime.exit(Runtime.java:90) 
        at java.lang.System.exit(System.java:904) 
        at 
org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:149)
{noformat}


*Logs*

{noformat}
2011-06-21 10:09:59,730 - FATAL [SyncThread:2:SyncRequestProcessor@148] - 
Severe unrecoverable error, exiting 
java.io.IOException: No space left on device 
        at java.io.FileOutputStream.writeBytes(Native Method) 
        at java.io.FileOutputStream.write(FileOutputStream.java:260) 
        at 
java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65) 
        at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123) 
        at 
org.apache.zookeeper.server.persistence.FileTxnLog.commit(FileTxnLog.java:305) 
        at 
org.apache.zookeeper.server.persistence.FileTxnSnapLog.commit(FileTxnSnapLog.java:324)
 
        at org.apache.zookeeper.server.ZKDatabase.commit(ZKDatabase.java:484) 
        at 
org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestProcessor.java:158)
 
        at 
org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:98)
 
2011-06-21 10:09:59,732 - INFO  [Thread-2:QuorumPeer@691] - The Quorum server 
is going for shutdown 
2011-06-21 10:09:59,732 - INFO  [Thread-2:Leader@393] - Shutdown called 
java.lang.Exception: shutdown Leader! reason: quorum Peer shutdown 
        at org.apache.zookeeper.server.quorum.Leader.shutdown(Leader.java:393) 
        at 
org.apache.zookeeper.server.quorum.QuorumPeer.shutdown(QuorumPeer.java:694) 
        at 
org.apache.zookeeper.server.quorum.QuorumPeerMain$1.run(QuorumPeerMain.java:126)
 
2011-06-21 10:09:59,733 - INFO  [Thread-6:Leader$LearnerCnxAcceptor@243] - 
exception while shutting down acceptor: java.net.SocketException: Socket closed 
2011-06-21 10:09:59,758 - INFO  [ProcessThread:-1:PrepRequestProcessor@120] - 
PrepRequestProcessor exited loop! 
2011-06-21 10:09:59,758 - INFO  [CommitProcessor:2:CommitProcessor@150] - 
CommitProcessor exited loop! 
2011-06-21 10:09:59,759 - INFO  [Thread-2:FinalRequestProcessor@379] - shutdown 
of request processor complete 
2011-06-21 10:10:00,000 - INFO  [SessionTracker:SessionTrackerImpl@165] - 
SessionTrackerImpl exited loop! 
{noformat}


> Zookeeper service is down when SyncRequestProcessor meets any exception.
> ------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-1109
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1109
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: quorum
>    Affects Versions: 3.3.0, 3.3.1, 3.3.2, 3.3.3
>            Reporter: Laxman
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> *Problem* Zookeeper is not shut down completely when dataDir disk space is 
> full and ZK Cluster went into unserviceable state.
>  
> *Scenario*
> If the leader zookeeper disk is made full, the zookeeper is trying to 
> shutdown. But it is waiting indefinitely while shutting down the 
> SyncRequestProcessor thread.
> *Root Cause* 
> this.join() is invoked in the same thread where System.exit(11) has been 
> triggered.
> When disk space full happens, It got the exception as follows 'No space left 
> on device' and invoked System.exit(11) from the SyncRequestProcessor 
> thread(The following logs shows the same). Before exiting JVM, ZK will 
> execute the ShutdownHook of QuorumPeerMain and the flow comes to 
> SyncRequestProcessor.shutdown(). Here this.join() is invoked in the same 
> thread where System.exit(11) has been invoked.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to