[jira] [Commented] (STORM-1977) Leader Nimbus crashes with getClusterInfo when it doesn't have one or more replicated topology codes

ASF GitHub Bot (JIRA) Tue, 19 Jul 2016 08:26:30 -0700

    [ 
https://issues.apache.org/jira/browse/STORM-1977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15384355#comment-15384355
 ]


ASF GitHub Bot commented on STORM-1977:
---------------------------------------

Github user HeartSaVioR commented on the issue:

    https://github.com/apache/storm/pull/1574
  
    @revans2 
    IMO this patch will still work if we address both of the expectations: 1) 
BlobStore should list keys for all available keys for listing keys. 2) 
BlobStore should ensure that it will pull from another replica when blob is not 
in local. If we address only 1) it will break this patch and go back to now.
    
    Suppose both things are achieved, we would like to have nimbus giving up 
leadership rather than crash when we have zero replication count on one of 
topologies code in blobstore. Like we talked we don't want to crash Nimbus in 
this case, since itself is a replica of blobstore, so it incurs decreasing 
replication count of some part of blobs, and next leader nimbus has more chance 
to crash.
    
    But also as you said there should be few problem if we have enough nimbuses 
and syncing up blobs is enough fast and done often.


> Leader Nimbus crashes with getClusterInfo when it doesn't have one or more 
> replicated topology codes
> ----------------------------------------------------------------------------------------------------
>
>                 Key: STORM-1977
>                 URL: https://issues.apache.org/jira/browse/STORM-1977
>             Project: Apache Storm
>          Issue Type: Bug
>          Components: storm-core
>    Affects Versions: 1.0.0, 1.0.1
>            Reporter: Jungtaek Lim
>            Assignee: Jungtaek Lim
>            Priority: Critical
>
> While investigating STORM-1976, I found that there're cases for nimbus to not 
> having topology codes. 
> Before BlobStore, only nimbuses which is having all topology codes can gain 
> leadership, otherwise they give up leadership immediately. While introducing 
> BlobStore, this logic is removed.
> I don't know it's intended or not, but it incurs one of nimbus to gain 
> leadership which doesn't have replicated topology code, and the nimbus will 
> be crashed when getClusterInfo is requested.
> Easiest way to reproduce is:
> 1. comment cleanup-corrupt-topologies! from nimbus.clj (It's a quick 
> workaround for resolving STORM-1976), and patch Storm cluster
> 2. Launch Nimbus 1 (leader)
> 3. Run topology
> 4. Kill Nimbus 1
> 5. Launch Nimbus 2 from different node
> 6. Nimbus 2 gains leadership 
> 7. getClusterInfo is requested to Nimbus 2, and Nimbus 2 gets crashed
> Log:
> {code}
> 2016-07-17 08:47:48.378 o.a.s.b.FileBlobStoreImpl [INFO] Creating new blob 
> store based in /grid/0/hadoop/storm/blobs
> ...
> 2016-07-17 08:47:48.619 o.a.s.zookeeper [INFO] Queued up for leader lock.
> 2016-07-17 08:47:48.651 o.a.s.zookeeper [INFO] <node1> gained leadership
> ...
> 2016-07-17 08:47:48.833 o.a.s.d.nimbus [INFO] Starting nimbus server for 
> storm version '1.1.1-SNAPSHOT'
> 2016-07-17 08:47:49.295 o.a.s.t.ProcessFunction [ERROR] Internal error 
> processing getClusterInfo
> KeyNotFoundException(msg:production-topology-2-1468745167-stormcode.ser)
>         at 
> org.apache.storm.blobstore.LocalFsBlobStore.getStoredBlobMeta(LocalFsBlobStore.java:149)
>         at 
> org.apache.storm.blobstore.LocalFsBlobStore.getBlobReplication(LocalFsBlobStore.java:268)
> ...
>         at 
> org.apache.storm.daemon.nimbus$get_blob_replication_count.invoke(nimbus.clj:498)
>         at 
> org.apache.storm.daemon.nimbus$get_cluster_info$iter__9520__9524$fn__9525.invoke(nimbus.clj:1427)
> ...
>         at 
> org.apache.storm.daemon.nimbus$get_cluster_info.invoke(nimbus.clj:1401)
>         at 
> org.apache.storm.daemon.nimbus$mk_reified_nimbus$reify__9612.getClusterInfo(nimbus.clj:1838)
>         at 
> org.apache.storm.generated.Nimbus$Processor$getClusterInfo.getResult(Nimbus.java:3724)
>         at 
> org.apache.storm.generated.Nimbus$Processor$getClusterInfo.getResult(Nimbus.java:3708)
>         at 
> org.apache.storm.thrift.ProcessFunction.process(ProcessFunction.java:39)
> ...
> 2016-07-17 08:47:49.397 o.a.s.b.BlobStoreUtils [ERROR] Could not download 
> blob with keyproduction-topology-2-1468745167-stormconf.ser
> 2016-07-17 08:47:49.400 o.a.s.b.BlobStoreUtils [ERROR] Could not update the 
> blob with keyproduction-topology-2-1468745167-stormconf.ser
> 2016-07-17 08:47:49.402 o.a.s.d.nimbus [ERROR] Error when processing event
> KeyNotFoundException(msg:production-topology-2-1468745167-stormconf.ser)
>         at 
> org.apache.storm.blobstore.LocalFsBlobStore.getStoredBlobMeta(LocalFsBlobStore.java:149)
>         at 
> org.apache.storm.blobstore.LocalFsBlobStore.getBlob(LocalFsBlobStore.java:239)
>         at org.apache.storm.blobstore.BlobStore.readBlobTo(BlobStore.java:271)
>         at org.apache.storm.blobstore.BlobStore.readBlob(BlobStore.java:300)
> ...
>        at clojure.lang.Reflector.invokeMatchingMethod(Reflector.java:93)
>         at clojure.lang.Reflector.invokeInstanceMethod(Reflector.java:28)
>         at 
> org.apache.storm.daemon.nimbus$read_storm_conf_as_nimbus.invoke(nimbus.clj:548)
>         at 
> org.apache.storm.daemon.nimbus$read_topology_details.invoke(nimbus.clj:555)
>         at 
> org.apache.storm.daemon.nimbus$mk_assignments$iter__9205__9209$fn__9210.invoke(nimbus.clj:912)
> ...
>         at 
> org.apache.storm.daemon.nimbus$mk_assignments.doInvoke(nimbus.clj:911)
>         at clojure.lang.RestFn.invoke(RestFn.java:410)
>         at 
> org.apache.storm.daemon.nimbus$fn__9769$exec_fn__1363__auto____9770$fn__9781$fn__9782.invoke(nimbus.clj:2216)
>         at 
> org.apache.storm.daemon.nimbus$fn__9769$exec_fn__1363__auto____9770$fn__9781.invoke(nimbus.clj:2215)
>         at 
> org.apache.storm.timer$schedule_recurring$this__1732.invoke(timer.clj:105)
>         at 
> org.apache.storm.timer$mk_timer$fn__1715$fn__1716.invoke(timer.clj:50)
>         at org.apache.storm.timer$mk_timer$fn__1715.invoke(timer.clj:42)
> ...
> 2016-07-17 08:47:49.408 o.a.s.util [ERROR] Halting process: ("Error when 
> processing an event")
> java.lang.RuntimeException: ("Error when processing an event")
>         at org.apache.storm.util$exit_process_BANG_.doInvoke(util.clj:341)
>         at clojure.lang.RestFn.invoke(RestFn.java:423)
>         at 
> org.apache.storm.daemon.nimbus$nimbus_data$fn__8727.invoke(nimbus.clj:205)
>         at 
> org.apache.storm.timer$mk_timer$fn__1715$fn__1716.invoke(timer.clj:71)
>         at org.apache.storm.timer$mk_timer$fn__1715.invoke(timer.clj:42)
>         at clojure.lang.AFn.run(AFn.java:22)
>         at java.lang.Thread.run(Thread.java:745)
> 2016-07-17 08:47:49.410 o.a.s.d.nimbus [INFO] Shutting down master
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (STORM-1977) Leader Nimbus crashes with getClusterInfo when it doesn't have one or more replicated topology codes

Reply via email to