[ https://issues.apache.org/jira/browse/STORM-1977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15382505#comment-15382505 ]
ASF GitHub Bot commented on STORM-1977: --------------------------------------- Github user revans2 commented on the issue: https://github.com/apache/storm/pull/1574 @HeartSaVioR I am OK with this change, like I am OK with the change for STORM-1976. I just don't think that this is the final solution for the local blobstore+nimbus, nor do I think that either is a blocker. The reality is if I use HDFS as the backing for the blobstore and I only set it to have a single replica, then I lose a datanode, nimbus will still crash in almost exactly the same way. Is this a bug in nimbus? Is it a bug in the blobstore or HDFS? I would say no. It is user error because the user is trying to make something HA without configuring it properly. Then when an error happens we cannot recover. So the question is, if it does happen is it better for nimbus to crash or is it better for nimbus to hang? Because all this does is it switches from one to the other. I can see advantages for hanging over crashing, so I am OK with this fix. Does the blobstore code have bugs in it and things that we can change to make it work better? I would expect it to, it is software after all. I just want us to spend some time thinking about how we really want it to behave and fix it properly. If that proper fix comes after making a few patches to improve things, that is fine. > Leader Nimbus crashes with getClusterInfo when it doesn't have one or more > replicated topology codes > ---------------------------------------------------------------------------------------------------- > > Key: STORM-1977 > URL: https://issues.apache.org/jira/browse/STORM-1977 > Project: Apache Storm > Issue Type: Bug > Components: storm-core > Affects Versions: 1.0.0, 1.0.1 > Reporter: Jungtaek Lim > Assignee: Jungtaek Lim > Priority: Critical > > While investigating STORM-1976, I found that there're cases for nimbus to not > having topology codes. > Before BlobStore, only nimbuses which is having all topology codes can gain > leadership, otherwise they give up leadership immediately. While introducing > BlobStore, this logic is removed. > I don't know it's intended or not, but it incurs one of nimbus to gain > leadership which doesn't have replicated topology code, and the nimbus will > be crashed when getClusterInfo is requested. > Easiest way to reproduce is: > 1. comment cleanup-corrupt-topologies! from nimbus.clj (It's a quick > workaround for resolving STORM-1976), and patch Storm cluster > 2. Launch Nimbus 1 (leader) > 3. Run topology > 4. Kill Nimbus 1 > 5. Launch Nimbus 2 from different node > 6. Nimbus 2 gains leadership > 7. getClusterInfo is requested to Nimbus 2, and Nimbus 2 gets crashed > Log: > {code} > 2016-07-17 08:47:48.378 o.a.s.b.FileBlobStoreImpl [INFO] Creating new blob > store based in /grid/0/hadoop/storm/blobs > ... > 2016-07-17 08:47:48.619 o.a.s.zookeeper [INFO] Queued up for leader lock. > 2016-07-17 08:47:48.651 o.a.s.zookeeper [INFO] <node1> gained leadership > ... > 2016-07-17 08:47:48.833 o.a.s.d.nimbus [INFO] Starting nimbus server for > storm version '1.1.1-SNAPSHOT' > 2016-07-17 08:47:49.295 o.a.s.t.ProcessFunction [ERROR] Internal error > processing getClusterInfo > KeyNotFoundException(msg:production-topology-2-1468745167-stormcode.ser) > at > org.apache.storm.blobstore.LocalFsBlobStore.getStoredBlobMeta(LocalFsBlobStore.java:149) > at > org.apache.storm.blobstore.LocalFsBlobStore.getBlobReplication(LocalFsBlobStore.java:268) > ... > at > org.apache.storm.daemon.nimbus$get_blob_replication_count.invoke(nimbus.clj:498) > at > org.apache.storm.daemon.nimbus$get_cluster_info$iter__9520__9524$fn__9525.invoke(nimbus.clj:1427) > ... > at > org.apache.storm.daemon.nimbus$get_cluster_info.invoke(nimbus.clj:1401) > at > org.apache.storm.daemon.nimbus$mk_reified_nimbus$reify__9612.getClusterInfo(nimbus.clj:1838) > at > org.apache.storm.generated.Nimbus$Processor$getClusterInfo.getResult(Nimbus.java:3724) > at > org.apache.storm.generated.Nimbus$Processor$getClusterInfo.getResult(Nimbus.java:3708) > at > org.apache.storm.thrift.ProcessFunction.process(ProcessFunction.java:39) > ... > 2016-07-17 08:47:49.397 o.a.s.b.BlobStoreUtils [ERROR] Could not download > blob with keyproduction-topology-2-1468745167-stormconf.ser > 2016-07-17 08:47:49.400 o.a.s.b.BlobStoreUtils [ERROR] Could not update the > blob with keyproduction-topology-2-1468745167-stormconf.ser > 2016-07-17 08:47:49.402 o.a.s.d.nimbus [ERROR] Error when processing event > KeyNotFoundException(msg:production-topology-2-1468745167-stormconf.ser) > at > org.apache.storm.blobstore.LocalFsBlobStore.getStoredBlobMeta(LocalFsBlobStore.java:149) > at > org.apache.storm.blobstore.LocalFsBlobStore.getBlob(LocalFsBlobStore.java:239) > at org.apache.storm.blobstore.BlobStore.readBlobTo(BlobStore.java:271) > at org.apache.storm.blobstore.BlobStore.readBlob(BlobStore.java:300) > ... > at clojure.lang.Reflector.invokeMatchingMethod(Reflector.java:93) > at clojure.lang.Reflector.invokeInstanceMethod(Reflector.java:28) > at > org.apache.storm.daemon.nimbus$read_storm_conf_as_nimbus.invoke(nimbus.clj:548) > at > org.apache.storm.daemon.nimbus$read_topology_details.invoke(nimbus.clj:555) > at > org.apache.storm.daemon.nimbus$mk_assignments$iter__9205__9209$fn__9210.invoke(nimbus.clj:912) > ... > at > org.apache.storm.daemon.nimbus$mk_assignments.doInvoke(nimbus.clj:911) > at clojure.lang.RestFn.invoke(RestFn.java:410) > at > org.apache.storm.daemon.nimbus$fn__9769$exec_fn__1363__auto____9770$fn__9781$fn__9782.invoke(nimbus.clj:2216) > at > org.apache.storm.daemon.nimbus$fn__9769$exec_fn__1363__auto____9770$fn__9781.invoke(nimbus.clj:2215) > at > org.apache.storm.timer$schedule_recurring$this__1732.invoke(timer.clj:105) > at > org.apache.storm.timer$mk_timer$fn__1715$fn__1716.invoke(timer.clj:50) > at org.apache.storm.timer$mk_timer$fn__1715.invoke(timer.clj:42) > ... > 2016-07-17 08:47:49.408 o.a.s.util [ERROR] Halting process: ("Error when > processing an event") > java.lang.RuntimeException: ("Error when processing an event") > at org.apache.storm.util$exit_process_BANG_.doInvoke(util.clj:341) > at clojure.lang.RestFn.invoke(RestFn.java:423) > at > org.apache.storm.daemon.nimbus$nimbus_data$fn__8727.invoke(nimbus.clj:205) > at > org.apache.storm.timer$mk_timer$fn__1715$fn__1716.invoke(timer.clj:71) > at org.apache.storm.timer$mk_timer$fn__1715.invoke(timer.clj:42) > at clojure.lang.AFn.run(AFn.java:22) > at java.lang.Thread.run(Thread.java:745) > 2016-07-17 08:47:49.410 o.a.s.d.nimbus [INFO] Shutting down master > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)