[ https://issues.apache.org/jira/browse/STORM-1977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15381251#comment-15381251 ]
Jungtaek Lim commented on STORM-1977: ------------------------------------- Leader nimbus should give away leadership when that nimbus doesn't have all of topologies codes, especially cluster is using Local BlobStore. This is a kind of regression between Nimbus H/A and BlobStore. > Leader Nimbus crashes with getClusterInfo when it doesn't have one or more > replicated topology codes > ---------------------------------------------------------------------------------------------------- > > Key: STORM-1977 > URL: https://issues.apache.org/jira/browse/STORM-1977 > Project: Apache Storm > Issue Type: Bug > Components: storm-core > Affects Versions: 1.0.0, 1.0.1 > Reporter: Jungtaek Lim > Assignee: Jungtaek Lim > Priority: Critical > > While investigating STORM-1976, I found that there're cases for nimbus to not > having topology codes. > Before BlobStore, only nimbuses which is having all topology codes can gain > leadership, otherwise they give up leadership immediately. While introducing > BlobStore, this logic is removed. > I don't know it's intended or not, but it incurs one of nimbus to gain > leadership which doesn't have replicated topology code, and the nimbus will > be crashed when getClusterInfo is requested. > Easiest way to reproduce is: > 1. comment cleanup-corrupt-topologies! from nimbus.clj (It's needed to > resolve STORM-1976), and patch Storm cluster > 2. Launch Nimbus 1 (leader) > 3. Run topology > 4. Kill Nimbus 1 > 5. Launch Nimbus 2 from different node > 6. Nimbus 2 gains leadership > 7. getClusterInfo is requested to Nimbus 2, and Nimbus 2 gets crashed > Log: > {code} > 2016-07-17 08:47:48.378 o.a.s.b.FileBlobStoreImpl [INFO] Creating new blob > store based in /grid/0/hadoop/storm/blobs > ... > 2016-07-17 08:47:48.619 o.a.s.zookeeper [INFO] Queued up for leader lock. > 2016-07-17 08:47:48.651 o.a.s.zookeeper [INFO] <node1> gained leadership > ... > 2016-07-17 08:47:48.833 o.a.s.d.nimbus [INFO] Starting nimbus server for > storm version '1.1.1-SNAPSHOT' > 2016-07-17 08:47:49.295 o.a.s.t.ProcessFunction [ERROR] Internal error > processing getClusterInfo > KeyNotFoundException(msg:production-topology-2-1468745167-stormcode.ser) > at > org.apache.storm.blobstore.LocalFsBlobStore.getStoredBlobMeta(LocalFsBlobStore.java:149) > at > org.apache.storm.blobstore.LocalFsBlobStore.getBlobReplication(LocalFsBlobStore.java:268) > ... > at > org.apache.storm.daemon.nimbus$get_blob_replication_count.invoke(nimbus.clj:498) > at > org.apache.storm.daemon.nimbus$get_cluster_info$iter__9520__9524$fn__9525.invoke(nimbus.clj:1427) > ... > at > org.apache.storm.daemon.nimbus$get_cluster_info.invoke(nimbus.clj:1401) > at > org.apache.storm.daemon.nimbus$mk_reified_nimbus$reify__9612.getClusterInfo(nimbus.clj:1838) > at > org.apache.storm.generated.Nimbus$Processor$getClusterInfo.getResult(Nimbus.java:3724) > at > org.apache.storm.generated.Nimbus$Processor$getClusterInfo.getResult(Nimbus.java:3708) > at > org.apache.storm.thrift.ProcessFunction.process(ProcessFunction.java:39) > ... > 2016-07-17 08:47:49.397 o.a.s.b.BlobStoreUtils [ERROR] Could not download > blob with keyproduction-topology-2-1468745167-stormconf.ser > 2016-07-17 08:47:49.400 o.a.s.b.BlobStoreUtils [ERROR] Could not update the > blob with keyproduction-topology-2-1468745167-stormconf.ser > 2016-07-17 08:47:49.402 o.a.s.d.nimbus [ERROR] Error when processing event > KeyNotFoundException(msg:production-topology-2-1468745167-stormconf.ser) > at > org.apache.storm.blobstore.LocalFsBlobStore.getStoredBlobMeta(LocalFsBlobStore.java:149) > at > org.apache.storm.blobstore.LocalFsBlobStore.getBlob(LocalFsBlobStore.java:239) > at org.apache.storm.blobstore.BlobStore.readBlobTo(BlobStore.java:271) > at org.apache.storm.blobstore.BlobStore.readBlob(BlobStore.java:300) > ... > at clojure.lang.Reflector.invokeMatchingMethod(Reflector.java:93) > at clojure.lang.Reflector.invokeInstanceMethod(Reflector.java:28) > at > org.apache.storm.daemon.nimbus$read_storm_conf_as_nimbus.invoke(nimbus.clj:548) > at > org.apache.storm.daemon.nimbus$read_topology_details.invoke(nimbus.clj:555) > at > org.apache.storm.daemon.nimbus$mk_assignments$iter__9205__9209$fn__9210.invoke(nimbus.clj:912) > ... > at > org.apache.storm.daemon.nimbus$mk_assignments.doInvoke(nimbus.clj:911) > at clojure.lang.RestFn.invoke(RestFn.java:410) > at > org.apache.storm.daemon.nimbus$fn__9769$exec_fn__1363__auto____9770$fn__9781$fn__9782.invoke(nimbus.clj:2216) > at > org.apache.storm.daemon.nimbus$fn__9769$exec_fn__1363__auto____9770$fn__9781.invoke(nimbus.clj:2215) > at > org.apache.storm.timer$schedule_recurring$this__1732.invoke(timer.clj:105) > at > org.apache.storm.timer$mk_timer$fn__1715$fn__1716.invoke(timer.clj:50) > at org.apache.storm.timer$mk_timer$fn__1715.invoke(timer.clj:42) > ... > 2016-07-17 08:47:49.408 o.a.s.util [ERROR] Halting process: ("Error when > processing an event") > java.lang.RuntimeException: ("Error when processing an event") > at org.apache.storm.util$exit_process_BANG_.doInvoke(util.clj:341) > at clojure.lang.RestFn.invoke(RestFn.java:423) > at > org.apache.storm.daemon.nimbus$nimbus_data$fn__8727.invoke(nimbus.clj:205) > at > org.apache.storm.timer$mk_timer$fn__1715$fn__1716.invoke(timer.clj:71) > at org.apache.storm.timer$mk_timer$fn__1715.invoke(timer.clj:42) > at clojure.lang.AFn.run(AFn.java:22) > at java.lang.Thread.run(Thread.java:745) > 2016-07-17 08:47:49.410 o.a.s.d.nimbus [INFO] Shutting down master > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)