Damien Raude-Morvan created STORM-307:
-----------------------------------------

             Summary: After host crash, supervisor is unable to restart itself
                 Key: STORM-307
                 URL: https://issues.apache.org/jira/browse/STORM-307
             Project: Apache Storm (Incubating)
          Issue Type: Bug
    Affects Versions: 0.9.1-incubating
         Environment: Debian Linux Wheezy
Zookeeper 3.3.3
Java 1.7.0_25
            Reporter: Damien Raude-Morvan


Hi,

I've observed [multiple times|#links] that supervisor state de-serialisation 
after host crash or reboot can fail. Supervisor is then unable to come up 
without manual intervention. AFAICT, it seems that serialized supervisor state 
if invalid and coun't be read at next start.

Observed error in supervisor log :
{noformat}
2014-04-29 19:38:35 c.n.c.f.i.CuratorFrameworkImpl [INFO] Starting
2014-04-29 19:38:35 o.a.z.ZooKeeper [INFO] Initiating client connection, 
connectString=127.0.0.1:2181/storm sessionTimeout=20000 
watcher=com.netflix.curator.ConnectionState@18d055e0
2014-04-29 19:38:35 o.a.z.ClientCnxn [INFO] Opening socket connection to server 
/127.0.0.1:2181
2014-04-29 19:38:35 o.a.z.ClientCnxn [INFO] Socket connection established to 
localhost/127.0.0.1:2181, initiating session
2014-04-29 19:38:35 o.a.z.ClientCnxn [INFO] Session establishment complete on 
server localhost/127.0.0.1:2181, sessionid = 0x145a7cc1c7e48b1, negotiated 
timeout = 20000
2014-04-29 19:38:35 b.s.d.supervisor [INFO] Starting supervisor with id 
71b01216-9d00-4fb6-8538-6673058ab5ef at host storm
2014-04-29 19:38:36 b.s.event [ERROR] Error when processing event
java.lang.RuntimeException: java.io.EOFException
        at backtype.storm.utils.Utils.deserialize(Utils.java:86) 
~[storm-core-0.9.1-incubating.jar:0.9.1-incubating]
        at backtype.storm.utils.LocalState.snapshot(LocalState.java:45) 
~[storm-core-0.9.1-incubating.jar:0.9.1-incubating]
        at backtype.storm.utils.LocalState.get(LocalState.java:56) 
~[storm-core-0.9.1-incubating.jar:0.9.1-incubating]
        at 
backtype.storm.daemon.supervisor$sync_processes.invoke(supervisor.clj:207) 
~[storm-core-0.9.1-incubating.jar:0.9.1-incubating]
        at clojure.lang.AFn.applyToHelper(AFn.java:161) ~[clojure-1.4.0.jar:na]
        at clojure.lang.AFn.applyTo(AFn.java:151) ~[clojure-1.4.0.jar:na]
        at clojure.core$apply.invoke(core.clj:603) ~[clojure-1.4.0.jar:na]
        at clojure.core$partial$fn__4070.doInvoke(core.clj:2343) 
~[clojure-1.4.0.jar:na]
        at clojure.lang.RestFn.invoke(RestFn.java:397) ~[clojure-1.4.0.jar:na]
        at backtype.storm.event$event_manager$fn__2593.invoke(event.clj:39) 
~[na:na]
        at clojure.lang.AFn.run(AFn.java:24) ~[clojure-1.4.0.jar:na]
        at java.lang.Thread.run(Thread.java:724) ~[na:1.7.0_25]
Caused by: java.io.EOFException: null
        at 
java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2323)
 ~[na:1.7.0_25]
        at 
java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:2792)
 ~[na:1.7.0_25]
        at 
java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:799) 
~[na:1.7.0_25]
        at java.io.ObjectInputStream.<init>(ObjectInputStream.java:299) 
~[na:1.7.0_25]
        at backtype.storm.utils.Utils.deserialize(Utils.java:81) 
~[storm-core-0.9.1-incubating.jar:0.9.1-incubating]
        ... 11 common frames omitted
2014-04-29 19:38:36 b.s.util [INFO] Halting process: ("Error when processing an 
event")
{noformat}

Current workaround : full stop supervisor daemon and delete all Storm's 
data/supervisor directory helped, and after restarting Supervisor is now 
running smoothly. 

{anchor:links} Here is some references of very similar issues :
* 
http://mail-archives.apache.org/mod_mbox/storm-user/201402.mbox/%3c23100d14e7ac4cef947f7236ef896...@by2pr08mb144.namprd08.prod.outlook.com%3E
* https://groups.google.com/forum/#!topic/storm-user/SL9FK9XeoI8
* https://groups.google.com/forum/#!topic/storm-user/2gapTYTRrX8

Regards,




--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to