Damien Raude-Morvan created STORM-307:
-----------------------------------------
Summary: After host crash, supervisor is unable to restart itself
Key: STORM-307
URL: https://issues.apache.org/jira/browse/STORM-307
Project: Apache Storm (Incubating)
Issue Type: Bug
Affects Versions: 0.9.1-incubating
Environment: Debian Linux Wheezy
Zookeeper 3.3.3
Java 1.7.0_25
Reporter: Damien Raude-Morvan
Hi,
I've observed [multiple times|#links] that supervisor state de-serialisation
after host crash or reboot can fail. Supervisor is then unable to come up
without manual intervention. AFAICT, it seems that serialized supervisor state
if invalid and coun't be read at next start.
Observed error in supervisor log :
{noformat}
2014-04-29 19:38:35 c.n.c.f.i.CuratorFrameworkImpl [INFO] Starting
2014-04-29 19:38:35 o.a.z.ZooKeeper [INFO] Initiating client connection,
connectString=127.0.0.1:2181/storm sessionTimeout=20000
watcher=com.netflix.curator.ConnectionState@18d055e0
2014-04-29 19:38:35 o.a.z.ClientCnxn [INFO] Opening socket connection to server
/127.0.0.1:2181
2014-04-29 19:38:35 o.a.z.ClientCnxn [INFO] Socket connection established to
localhost/127.0.0.1:2181, initiating session
2014-04-29 19:38:35 o.a.z.ClientCnxn [INFO] Session establishment complete on
server localhost/127.0.0.1:2181, sessionid = 0x145a7cc1c7e48b1, negotiated
timeout = 20000
2014-04-29 19:38:35 b.s.d.supervisor [INFO] Starting supervisor with id
71b01216-9d00-4fb6-8538-6673058ab5ef at host storm
2014-04-29 19:38:36 b.s.event [ERROR] Error when processing event
java.lang.RuntimeException: java.io.EOFException
at backtype.storm.utils.Utils.deserialize(Utils.java:86)
~[storm-core-0.9.1-incubating.jar:0.9.1-incubating]
at backtype.storm.utils.LocalState.snapshot(LocalState.java:45)
~[storm-core-0.9.1-incubating.jar:0.9.1-incubating]
at backtype.storm.utils.LocalState.get(LocalState.java:56)
~[storm-core-0.9.1-incubating.jar:0.9.1-incubating]
at
backtype.storm.daemon.supervisor$sync_processes.invoke(supervisor.clj:207)
~[storm-core-0.9.1-incubating.jar:0.9.1-incubating]
at clojure.lang.AFn.applyToHelper(AFn.java:161) ~[clojure-1.4.0.jar:na]
at clojure.lang.AFn.applyTo(AFn.java:151) ~[clojure-1.4.0.jar:na]
at clojure.core$apply.invoke(core.clj:603) ~[clojure-1.4.0.jar:na]
at clojure.core$partial$fn__4070.doInvoke(core.clj:2343)
~[clojure-1.4.0.jar:na]
at clojure.lang.RestFn.invoke(RestFn.java:397) ~[clojure-1.4.0.jar:na]
at backtype.storm.event$event_manager$fn__2593.invoke(event.clj:39)
~[na:na]
at clojure.lang.AFn.run(AFn.java:24) ~[clojure-1.4.0.jar:na]
at java.lang.Thread.run(Thread.java:724) ~[na:1.7.0_25]
Caused by: java.io.EOFException: null
at
java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2323)
~[na:1.7.0_25]
at
java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:2792)
~[na:1.7.0_25]
at
java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:799)
~[na:1.7.0_25]
at java.io.ObjectInputStream.<init>(ObjectInputStream.java:299)
~[na:1.7.0_25]
at backtype.storm.utils.Utils.deserialize(Utils.java:81)
~[storm-core-0.9.1-incubating.jar:0.9.1-incubating]
... 11 common frames omitted
2014-04-29 19:38:36 b.s.util [INFO] Halting process: ("Error when processing an
event")
{noformat}
Current workaround : full stop supervisor daemon and delete all Storm's
data/supervisor directory helped, and after restarting Supervisor is now
running smoothly.
{anchor:links} Here is some references of very similar issues :
*
http://mail-archives.apache.org/mod_mbox/storm-user/201402.mbox/%3c23100d14e7ac4cef947f7236ef896...@by2pr08mb144.namprd08.prod.outlook.com%3E
* https://groups.google.com/forum/#!topic/storm-user/SL9FK9XeoI8
* https://groups.google.com/forum/#!topic/storm-user/2gapTYTRrX8
Regards,
--
This message was sent by Atlassian JIRA
(v6.2#6252)