[ 
https://issues.apache.org/jira/browse/STORM-1977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim updated STORM-1977:
--------------------------------
    Description: 
While investigating STORM-1976, I found that there're cases for nimbus to not 
having topology codes. 
Before BlobStore, only nimbuses which is having all topology codes can gain 
leadership, otherwise they give up leadership immediately. While introducing 
BlobStore, this logic is removed.

I don't know it's intended or not, but it incurs one of nimbus to gain 
leadership which doesn't have replicated topology code, and the nimbus will be 
crashed when getClusterInfo is requested.

Easiest way to reproduce is:

1. comment cleanup-corrupt-topologies! from nimbus.clj (It's a quick workaround 
for resolving STORM-1976), and patch Storm cluster
2. Launch Nimbus 1 (leader)
3. Run topology
4. Kill Nimbus 1
5. Launch Nimbus 2 from different node
6. Nimbus 2 gains leadership 
7. getClusterInfo is requested to Nimbus 2, and Nimbus 2 gets crashed

Log:

{code}
2016-07-17 08:47:48.378 o.a.s.b.FileBlobStoreImpl [INFO] Creating new blob 
store based in /grid/0/hadoop/storm/blobs
...
2016-07-17 08:47:48.619 o.a.s.zookeeper [INFO] Queued up for leader lock.
2016-07-17 08:47:48.651 o.a.s.zookeeper [INFO] <node1> gained leadership
...
2016-07-17 08:47:48.833 o.a.s.d.nimbus [INFO] Starting nimbus server for storm 
version '1.1.1-SNAPSHOT'
2016-07-17 08:47:49.295 o.a.s.t.ProcessFunction [ERROR] Internal error 
processing getClusterInfo
KeyNotFoundException(msg:production-topology-2-1468745167-stormcode.ser)
        at 
org.apache.storm.blobstore.LocalFsBlobStore.getStoredBlobMeta(LocalFsBlobStore.java:149)
        at 
org.apache.storm.blobstore.LocalFsBlobStore.getBlobReplication(LocalFsBlobStore.java:268)
...
        at 
org.apache.storm.daemon.nimbus$get_blob_replication_count.invoke(nimbus.clj:498)
        at 
org.apache.storm.daemon.nimbus$get_cluster_info$iter__9520__9524$fn__9525.invoke(nimbus.clj:1427)
...
        at 
org.apache.storm.daemon.nimbus$get_cluster_info.invoke(nimbus.clj:1401)
        at 
org.apache.storm.daemon.nimbus$mk_reified_nimbus$reify__9612.getClusterInfo(nimbus.clj:1838)
        at 
org.apache.storm.generated.Nimbus$Processor$getClusterInfo.getResult(Nimbus.java:3724)
        at 
org.apache.storm.generated.Nimbus$Processor$getClusterInfo.getResult(Nimbus.java:3708)
        at 
org.apache.storm.thrift.ProcessFunction.process(ProcessFunction.java:39)
...
2016-07-17 08:47:49.397 o.a.s.b.BlobStoreUtils [ERROR] Could not download blob 
with keyproduction-topology-2-1468745167-stormconf.ser
2016-07-17 08:47:49.400 o.a.s.b.BlobStoreUtils [ERROR] Could not update the 
blob with keyproduction-topology-2-1468745167-stormconf.ser
2016-07-17 08:47:49.402 o.a.s.d.nimbus [ERROR] Error when processing event
KeyNotFoundException(msg:production-topology-2-1468745167-stormconf.ser)
        at 
org.apache.storm.blobstore.LocalFsBlobStore.getStoredBlobMeta(LocalFsBlobStore.java:149)
        at 
org.apache.storm.blobstore.LocalFsBlobStore.getBlob(LocalFsBlobStore.java:239)
        at org.apache.storm.blobstore.BlobStore.readBlobTo(BlobStore.java:271)
        at org.apache.storm.blobstore.BlobStore.readBlob(BlobStore.java:300)
...
       at clojure.lang.Reflector.invokeMatchingMethod(Reflector.java:93)
        at clojure.lang.Reflector.invokeInstanceMethod(Reflector.java:28)
        at 
org.apache.storm.daemon.nimbus$read_storm_conf_as_nimbus.invoke(nimbus.clj:548)
        at 
org.apache.storm.daemon.nimbus$read_topology_details.invoke(nimbus.clj:555)
        at 
org.apache.storm.daemon.nimbus$mk_assignments$iter__9205__9209$fn__9210.invoke(nimbus.clj:912)
...
        at 
org.apache.storm.daemon.nimbus$mk_assignments.doInvoke(nimbus.clj:911)
        at clojure.lang.RestFn.invoke(RestFn.java:410)
        at 
org.apache.storm.daemon.nimbus$fn__9769$exec_fn__1363__auto____9770$fn__9781$fn__9782.invoke(nimbus.clj:2216)
        at 
org.apache.storm.daemon.nimbus$fn__9769$exec_fn__1363__auto____9770$fn__9781.invoke(nimbus.clj:2215)
        at 
org.apache.storm.timer$schedule_recurring$this__1732.invoke(timer.clj:105)
        at 
org.apache.storm.timer$mk_timer$fn__1715$fn__1716.invoke(timer.clj:50)
        at org.apache.storm.timer$mk_timer$fn__1715.invoke(timer.clj:42)
...
2016-07-17 08:47:49.408 o.a.s.util [ERROR] Halting process: ("Error when 
processing an event")
java.lang.RuntimeException: ("Error when processing an event")
        at org.apache.storm.util$exit_process_BANG_.doInvoke(util.clj:341)
        at clojure.lang.RestFn.invoke(RestFn.java:423)
        at 
org.apache.storm.daemon.nimbus$nimbus_data$fn__8727.invoke(nimbus.clj:205)
        at 
org.apache.storm.timer$mk_timer$fn__1715$fn__1716.invoke(timer.clj:71)
        at org.apache.storm.timer$mk_timer$fn__1715.invoke(timer.clj:42)
        at clojure.lang.AFn.run(AFn.java:22)
        at java.lang.Thread.run(Thread.java:745)
2016-07-17 08:47:49.410 o.a.s.d.nimbus [INFO] Shutting down master
{code}


  was:
While investigating STORM-1976, I found that there're cases for nimbus to not 
having topology codes. 
Before BlobStore, only nimbuses which is having all topology codes can gain 
leadership, otherwise they give up leadership immediately. While introducing 
BlobStore, this logic is removed.

I don't know it's intended or not, but it incurs one of nimbus to gain 
leadership which doesn't have replicated topology code, and the nimbus will be 
crashed when getClusterInfo is requested.

Easiest way to reproduce is:

1. comment cleanup-corrupt-topologies! from nimbus.clj (It's needed to resolve 
STORM-1976), and patch Storm cluster
2. Launch Nimbus 1 (leader)
3. Run topology
4. Kill Nimbus 1
5. Launch Nimbus 2 from different node
6. Nimbus 2 gains leadership 
7. getClusterInfo is requested to Nimbus 2, and Nimbus 2 gets crashed

Log:

{code}
2016-07-17 08:47:48.378 o.a.s.b.FileBlobStoreImpl [INFO] Creating new blob 
store based in /grid/0/hadoop/storm/blobs
...
2016-07-17 08:47:48.619 o.a.s.zookeeper [INFO] Queued up for leader lock.
2016-07-17 08:47:48.651 o.a.s.zookeeper [INFO] <node1> gained leadership
...
2016-07-17 08:47:48.833 o.a.s.d.nimbus [INFO] Starting nimbus server for storm 
version '1.1.1-SNAPSHOT'
2016-07-17 08:47:49.295 o.a.s.t.ProcessFunction [ERROR] Internal error 
processing getClusterInfo
KeyNotFoundException(msg:production-topology-2-1468745167-stormcode.ser)
        at 
org.apache.storm.blobstore.LocalFsBlobStore.getStoredBlobMeta(LocalFsBlobStore.java:149)
        at 
org.apache.storm.blobstore.LocalFsBlobStore.getBlobReplication(LocalFsBlobStore.java:268)
...
        at 
org.apache.storm.daemon.nimbus$get_blob_replication_count.invoke(nimbus.clj:498)
        at 
org.apache.storm.daemon.nimbus$get_cluster_info$iter__9520__9524$fn__9525.invoke(nimbus.clj:1427)
...
        at 
org.apache.storm.daemon.nimbus$get_cluster_info.invoke(nimbus.clj:1401)
        at 
org.apache.storm.daemon.nimbus$mk_reified_nimbus$reify__9612.getClusterInfo(nimbus.clj:1838)
        at 
org.apache.storm.generated.Nimbus$Processor$getClusterInfo.getResult(Nimbus.java:3724)
        at 
org.apache.storm.generated.Nimbus$Processor$getClusterInfo.getResult(Nimbus.java:3708)
        at 
org.apache.storm.thrift.ProcessFunction.process(ProcessFunction.java:39)
...
2016-07-17 08:47:49.397 o.a.s.b.BlobStoreUtils [ERROR] Could not download blob 
with keyproduction-topology-2-1468745167-stormconf.ser
2016-07-17 08:47:49.400 o.a.s.b.BlobStoreUtils [ERROR] Could not update the 
blob with keyproduction-topology-2-1468745167-stormconf.ser
2016-07-17 08:47:49.402 o.a.s.d.nimbus [ERROR] Error when processing event
KeyNotFoundException(msg:production-topology-2-1468745167-stormconf.ser)
        at 
org.apache.storm.blobstore.LocalFsBlobStore.getStoredBlobMeta(LocalFsBlobStore.java:149)
        at 
org.apache.storm.blobstore.LocalFsBlobStore.getBlob(LocalFsBlobStore.java:239)
        at org.apache.storm.blobstore.BlobStore.readBlobTo(BlobStore.java:271)
        at org.apache.storm.blobstore.BlobStore.readBlob(BlobStore.java:300)
...
       at clojure.lang.Reflector.invokeMatchingMethod(Reflector.java:93)
        at clojure.lang.Reflector.invokeInstanceMethod(Reflector.java:28)
        at 
org.apache.storm.daemon.nimbus$read_storm_conf_as_nimbus.invoke(nimbus.clj:548)
        at 
org.apache.storm.daemon.nimbus$read_topology_details.invoke(nimbus.clj:555)
        at 
org.apache.storm.daemon.nimbus$mk_assignments$iter__9205__9209$fn__9210.invoke(nimbus.clj:912)
...
        at 
org.apache.storm.daemon.nimbus$mk_assignments.doInvoke(nimbus.clj:911)
        at clojure.lang.RestFn.invoke(RestFn.java:410)
        at 
org.apache.storm.daemon.nimbus$fn__9769$exec_fn__1363__auto____9770$fn__9781$fn__9782.invoke(nimbus.clj:2216)
        at 
org.apache.storm.daemon.nimbus$fn__9769$exec_fn__1363__auto____9770$fn__9781.invoke(nimbus.clj:2215)
        at 
org.apache.storm.timer$schedule_recurring$this__1732.invoke(timer.clj:105)
        at 
org.apache.storm.timer$mk_timer$fn__1715$fn__1716.invoke(timer.clj:50)
        at org.apache.storm.timer$mk_timer$fn__1715.invoke(timer.clj:42)
...
2016-07-17 08:47:49.408 o.a.s.util [ERROR] Halting process: ("Error when 
processing an event")
java.lang.RuntimeException: ("Error when processing an event")
        at org.apache.storm.util$exit_process_BANG_.doInvoke(util.clj:341)
        at clojure.lang.RestFn.invoke(RestFn.java:423)
        at 
org.apache.storm.daemon.nimbus$nimbus_data$fn__8727.invoke(nimbus.clj:205)
        at 
org.apache.storm.timer$mk_timer$fn__1715$fn__1716.invoke(timer.clj:71)
        at org.apache.storm.timer$mk_timer$fn__1715.invoke(timer.clj:42)
        at clojure.lang.AFn.run(AFn.java:22)
        at java.lang.Thread.run(Thread.java:745)
2016-07-17 08:47:49.410 o.a.s.d.nimbus [INFO] Shutting down master
{code}



> Leader Nimbus crashes with getClusterInfo when it doesn't have one or more 
> replicated topology codes
> ----------------------------------------------------------------------------------------------------
>
>                 Key: STORM-1977
>                 URL: https://issues.apache.org/jira/browse/STORM-1977
>             Project: Apache Storm
>          Issue Type: Bug
>          Components: storm-core
>    Affects Versions: 1.0.0, 1.0.1
>            Reporter: Jungtaek Lim
>            Assignee: Jungtaek Lim
>            Priority: Critical
>
> While investigating STORM-1976, I found that there're cases for nimbus to not 
> having topology codes. 
> Before BlobStore, only nimbuses which is having all topology codes can gain 
> leadership, otherwise they give up leadership immediately. While introducing 
> BlobStore, this logic is removed.
> I don't know it's intended or not, but it incurs one of nimbus to gain 
> leadership which doesn't have replicated topology code, and the nimbus will 
> be crashed when getClusterInfo is requested.
> Easiest way to reproduce is:
> 1. comment cleanup-corrupt-topologies! from nimbus.clj (It's a quick 
> workaround for resolving STORM-1976), and patch Storm cluster
> 2. Launch Nimbus 1 (leader)
> 3. Run topology
> 4. Kill Nimbus 1
> 5. Launch Nimbus 2 from different node
> 6. Nimbus 2 gains leadership 
> 7. getClusterInfo is requested to Nimbus 2, and Nimbus 2 gets crashed
> Log:
> {code}
> 2016-07-17 08:47:48.378 o.a.s.b.FileBlobStoreImpl [INFO] Creating new blob 
> store based in /grid/0/hadoop/storm/blobs
> ...
> 2016-07-17 08:47:48.619 o.a.s.zookeeper [INFO] Queued up for leader lock.
> 2016-07-17 08:47:48.651 o.a.s.zookeeper [INFO] <node1> gained leadership
> ...
> 2016-07-17 08:47:48.833 o.a.s.d.nimbus [INFO] Starting nimbus server for 
> storm version '1.1.1-SNAPSHOT'
> 2016-07-17 08:47:49.295 o.a.s.t.ProcessFunction [ERROR] Internal error 
> processing getClusterInfo
> KeyNotFoundException(msg:production-topology-2-1468745167-stormcode.ser)
>         at 
> org.apache.storm.blobstore.LocalFsBlobStore.getStoredBlobMeta(LocalFsBlobStore.java:149)
>         at 
> org.apache.storm.blobstore.LocalFsBlobStore.getBlobReplication(LocalFsBlobStore.java:268)
> ...
>         at 
> org.apache.storm.daemon.nimbus$get_blob_replication_count.invoke(nimbus.clj:498)
>         at 
> org.apache.storm.daemon.nimbus$get_cluster_info$iter__9520__9524$fn__9525.invoke(nimbus.clj:1427)
> ...
>         at 
> org.apache.storm.daemon.nimbus$get_cluster_info.invoke(nimbus.clj:1401)
>         at 
> org.apache.storm.daemon.nimbus$mk_reified_nimbus$reify__9612.getClusterInfo(nimbus.clj:1838)
>         at 
> org.apache.storm.generated.Nimbus$Processor$getClusterInfo.getResult(Nimbus.java:3724)
>         at 
> org.apache.storm.generated.Nimbus$Processor$getClusterInfo.getResult(Nimbus.java:3708)
>         at 
> org.apache.storm.thrift.ProcessFunction.process(ProcessFunction.java:39)
> ...
> 2016-07-17 08:47:49.397 o.a.s.b.BlobStoreUtils [ERROR] Could not download 
> blob with keyproduction-topology-2-1468745167-stormconf.ser
> 2016-07-17 08:47:49.400 o.a.s.b.BlobStoreUtils [ERROR] Could not update the 
> blob with keyproduction-topology-2-1468745167-stormconf.ser
> 2016-07-17 08:47:49.402 o.a.s.d.nimbus [ERROR] Error when processing event
> KeyNotFoundException(msg:production-topology-2-1468745167-stormconf.ser)
>         at 
> org.apache.storm.blobstore.LocalFsBlobStore.getStoredBlobMeta(LocalFsBlobStore.java:149)
>         at 
> org.apache.storm.blobstore.LocalFsBlobStore.getBlob(LocalFsBlobStore.java:239)
>         at org.apache.storm.blobstore.BlobStore.readBlobTo(BlobStore.java:271)
>         at org.apache.storm.blobstore.BlobStore.readBlob(BlobStore.java:300)
> ...
>        at clojure.lang.Reflector.invokeMatchingMethod(Reflector.java:93)
>         at clojure.lang.Reflector.invokeInstanceMethod(Reflector.java:28)
>         at 
> org.apache.storm.daemon.nimbus$read_storm_conf_as_nimbus.invoke(nimbus.clj:548)
>         at 
> org.apache.storm.daemon.nimbus$read_topology_details.invoke(nimbus.clj:555)
>         at 
> org.apache.storm.daemon.nimbus$mk_assignments$iter__9205__9209$fn__9210.invoke(nimbus.clj:912)
> ...
>         at 
> org.apache.storm.daemon.nimbus$mk_assignments.doInvoke(nimbus.clj:911)
>         at clojure.lang.RestFn.invoke(RestFn.java:410)
>         at 
> org.apache.storm.daemon.nimbus$fn__9769$exec_fn__1363__auto____9770$fn__9781$fn__9782.invoke(nimbus.clj:2216)
>         at 
> org.apache.storm.daemon.nimbus$fn__9769$exec_fn__1363__auto____9770$fn__9781.invoke(nimbus.clj:2215)
>         at 
> org.apache.storm.timer$schedule_recurring$this__1732.invoke(timer.clj:105)
>         at 
> org.apache.storm.timer$mk_timer$fn__1715$fn__1716.invoke(timer.clj:50)
>         at org.apache.storm.timer$mk_timer$fn__1715.invoke(timer.clj:42)
> ...
> 2016-07-17 08:47:49.408 o.a.s.util [ERROR] Halting process: ("Error when 
> processing an event")
> java.lang.RuntimeException: ("Error when processing an event")
>         at org.apache.storm.util$exit_process_BANG_.doInvoke(util.clj:341)
>         at clojure.lang.RestFn.invoke(RestFn.java:423)
>         at 
> org.apache.storm.daemon.nimbus$nimbus_data$fn__8727.invoke(nimbus.clj:205)
>         at 
> org.apache.storm.timer$mk_timer$fn__1715$fn__1716.invoke(timer.clj:71)
>         at org.apache.storm.timer$mk_timer$fn__1715.invoke(timer.clj:42)
>         at clojure.lang.AFn.run(AFn.java:22)
>         at java.lang.Thread.run(Thread.java:745)
> 2016-07-17 08:47:49.410 o.a.s.d.nimbus [INFO] Shutting down master
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to