[jira] [Commented] (STORM-307) After host crash, supervisor is unable to restart itself

ASF GitHub Bot (JIRA) Mon, 06 Oct 2014 02:51:07 -0700

    [ 
https://issues.apache.org/jira/browse/STORM-307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14160145#comment-14160145
 ]


ASF GitHub Bot commented on STORM-307:
--------------------------------------

Github user wurstmeister commented on the pull request:

    https://github.com/apache/storm/pull/282#issuecomment-57995451
  
    Thanks @HeartSaVioR! should be fixed now.
    I was wondering if we actually need this retry logic. i.e. if we can just 
reset the state whenever there is an IOException (rather than not starting up 
and users having to manually remove state files). 
    
    I am not sure why the retry logic was implemented this way, so I picked out 
the empty file case because that's been reported multiple times (incl the 
workaround of resetting the state). 



> After host crash, supervisor is unable to restart itself
> --------------------------------------------------------
>
>                 Key: STORM-307
>                 URL: https://issues.apache.org/jira/browse/STORM-307
>             Project: Apache Storm
>          Issue Type: Bug
>    Affects Versions: 0.9.1-incubating
>         Environment: Debian Linux Wheezy
> Zookeeper 3.3.3
> Java 1.7.0_25
>            Reporter: Damien Raude-Morvan
>         Attachments: supeof.tar.bz2
>
>
> Hi,
> I've observed [multiple times|#links] that supervisor state de-serialisation 
> after host crash or reboot can fail. Supervisor is then unable to come up 
> without manual intervention. AFAICT, it seems that serialized supervisor 
> state if invalid and coun't be read at next start.
> Observed error in supervisor log :
> {noformat}
> 2014-04-29 19:38:35 c.n.c.f.i.CuratorFrameworkImpl [INFO] Starting
> 2014-04-29 19:38:35 o.a.z.ZooKeeper [INFO] Initiating client connection, 
> connectString=127.0.0.1:2181/storm sessionTimeout=20000 
> watcher=com.netflix.curator.ConnectionState@18d055e0
> 2014-04-29 19:38:35 o.a.z.ClientCnxn [INFO] Opening socket connection to 
> server /127.0.0.1:2181
> 2014-04-29 19:38:35 o.a.z.ClientCnxn [INFO] Socket connection established to 
> localhost/127.0.0.1:2181, initiating session
> 2014-04-29 19:38:35 o.a.z.ClientCnxn [INFO] Session establishment complete on 
> server localhost/127.0.0.1:2181, sessionid = 0x145a7cc1c7e48b1, negotiated 
> timeout = 20000
> 2014-04-29 19:38:35 b.s.d.supervisor [INFO] Starting supervisor with id 
> 71b01216-9d00-4fb6-8538-6673058ab5ef at host storm
> 2014-04-29 19:38:36 b.s.event [ERROR] Error when processing event
> java.lang.RuntimeException: java.io.EOFException
>         at backtype.storm.utils.Utils.deserialize(Utils.java:86) 
> ~[storm-core-0.9.1-incubating.jar:0.9.1-incubating]
>         at backtype.storm.utils.LocalState.snapshot(LocalState.java:45) 
> ~[storm-core-0.9.1-incubating.jar:0.9.1-incubating]
>         at backtype.storm.utils.LocalState.get(LocalState.java:56) 
> ~[storm-core-0.9.1-incubating.jar:0.9.1-incubating]
>         at 
> backtype.storm.daemon.supervisor$sync_processes.invoke(supervisor.clj:207) 
> ~[storm-core-0.9.1-incubating.jar:0.9.1-incubating]
>         at clojure.lang.AFn.applyToHelper(AFn.java:161) 
> ~[clojure-1.4.0.jar:na]
>         at clojure.lang.AFn.applyTo(AFn.java:151) ~[clojure-1.4.0.jar:na]
>         at clojure.core$apply.invoke(core.clj:603) ~[clojure-1.4.0.jar:na]
>         at clojure.core$partial$fn__4070.doInvoke(core.clj:2343) 
> ~[clojure-1.4.0.jar:na]
>         at clojure.lang.RestFn.invoke(RestFn.java:397) ~[clojure-1.4.0.jar:na]
>         at backtype.storm.event$event_manager$fn__2593.invoke(event.clj:39) 
> ~[na:na]
>         at clojure.lang.AFn.run(AFn.java:24) ~[clojure-1.4.0.jar:na]
>         at java.lang.Thread.run(Thread.java:724) ~[na:1.7.0_25]
> Caused by: java.io.EOFException: null
>         at 
> java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2323)
>  ~[na:1.7.0_25]
>         at 
> java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:2792)
>  ~[na:1.7.0_25]
>         at 
> java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:799) 
> ~[na:1.7.0_25]
>         at java.io.ObjectInputStream.<init>(ObjectInputStream.java:299) 
> ~[na:1.7.0_25]
>         at backtype.storm.utils.Utils.deserialize(Utils.java:81) 
> ~[storm-core-0.9.1-incubating.jar:0.9.1-incubating]
>         ... 11 common frames omitted
> 2014-04-29 19:38:36 b.s.util [INFO] Halting process: ("Error when processing 
> an event")
> {noformat}
> Current workaround : full stop supervisor daemon and delete all Storm's 
> data/supervisor directory helped, and after restarting Supervisor is now 
> running smoothly. 
> {anchor:links} Here is some references of very similar issues :
> * 
> http://mail-archives.apache.org/mod_mbox/storm-user/201402.mbox/%3c23100d14e7ac4cef947f7236ef896...@by2pr08mb144.namprd08.prod.outlook.com%3E
> * https://groups.google.com/forum/#!topic/storm-user/SL9FK9XeoI8
> * https://groups.google.com/forum/#!topic/storm-user/2gapTYTRrX8
> Regards,



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (STORM-307) After host crash, supervisor is unable to restart itself

Reply via email to