[
https://issues.apache.org/jira/browse/STORM-1977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15383353#comment-15383353
]
ASF GitHub Bot commented on STORM-1977:
---------------------------------------
Github user HeartSaVioR commented on the issue:
https://github.com/apache/storm/pull/1574
@revans2
I'm just thinking about the responsibility of Nimbus.
- Nimbus was "soft" SPOF and we claimed that Nimbus is designed to
fail-fast and stateless so just supervising Nimbus works like a charm. But it
doesn't help from machine failure, and moving Nimbus to other machine requires
at least configuration change of whole cluster. (This assumes that Supervisor
is also supervised by aux. process. If not starting Supervisor should be done
manually.)
- Nimbus H/A came in. It was relatively easier than other process on other
project since Nimbus is designed as stateless so no need to sync. Only thing
Nimbuses should sync up is topology code, and Nimbus H/A tried to address this
by full replications and restriction of becoming leader. It made some overhead
trying to replicate topology codes to all of Nimbuses but it was a best try to
achieve higher availability. When other nimbuses crashed but only one 'leader'
nimbus was alive, that was completely OK for that moment. There was a chance
for all alive nimbuses not having complete set of topology code thus no leader
and hang, but it was relatively smaller than counting replication count since
it was doing full replication at all.
- BlobStore came in. I don't know the details of BlobStore so hard to tell.
I'd be happy if you fill out this : After BlobStore.
One thing I'm concerning is, there's new requirement for Nimbus to not
easily crashed since every Nimbuses are also replica of BlobStore like
DataNode, but Nimbus itself has lots of works to do (sure for leader, and not
sure for followers) and it is still based on fail-fast. Is it OK to play
together?
> Leader Nimbus crashes with getClusterInfo when it doesn't have one or more
> replicated topology codes
> ----------------------------------------------------------------------------------------------------
>
> Key: STORM-1977
> URL: https://issues.apache.org/jira/browse/STORM-1977
> Project: Apache Storm
> Issue Type: Bug
> Components: storm-core
> Affects Versions: 1.0.0, 1.0.1
> Reporter: Jungtaek Lim
> Assignee: Jungtaek Lim
> Priority: Critical
>
> While investigating STORM-1976, I found that there're cases for nimbus to not
> having topology codes.
> Before BlobStore, only nimbuses which is having all topology codes can gain
> leadership, otherwise they give up leadership immediately. While introducing
> BlobStore, this logic is removed.
> I don't know it's intended or not, but it incurs one of nimbus to gain
> leadership which doesn't have replicated topology code, and the nimbus will
> be crashed when getClusterInfo is requested.
> Easiest way to reproduce is:
> 1. comment cleanup-corrupt-topologies! from nimbus.clj (It's a quick
> workaround for resolving STORM-1976), and patch Storm cluster
> 2. Launch Nimbus 1 (leader)
> 3. Run topology
> 4. Kill Nimbus 1
> 5. Launch Nimbus 2 from different node
> 6. Nimbus 2 gains leadership
> 7. getClusterInfo is requested to Nimbus 2, and Nimbus 2 gets crashed
> Log:
> {code}
> 2016-07-17 08:47:48.378 o.a.s.b.FileBlobStoreImpl [INFO] Creating new blob
> store based in /grid/0/hadoop/storm/blobs
> ...
> 2016-07-17 08:47:48.619 o.a.s.zookeeper [INFO] Queued up for leader lock.
> 2016-07-17 08:47:48.651 o.a.s.zookeeper [INFO] <node1> gained leadership
> ...
> 2016-07-17 08:47:48.833 o.a.s.d.nimbus [INFO] Starting nimbus server for
> storm version '1.1.1-SNAPSHOT'
> 2016-07-17 08:47:49.295 o.a.s.t.ProcessFunction [ERROR] Internal error
> processing getClusterInfo
> KeyNotFoundException(msg:production-topology-2-1468745167-stormcode.ser)
> at
> org.apache.storm.blobstore.LocalFsBlobStore.getStoredBlobMeta(LocalFsBlobStore.java:149)
> at
> org.apache.storm.blobstore.LocalFsBlobStore.getBlobReplication(LocalFsBlobStore.java:268)
> ...
> at
> org.apache.storm.daemon.nimbus$get_blob_replication_count.invoke(nimbus.clj:498)
> at
> org.apache.storm.daemon.nimbus$get_cluster_info$iter__9520__9524$fn__9525.invoke(nimbus.clj:1427)
> ...
> at
> org.apache.storm.daemon.nimbus$get_cluster_info.invoke(nimbus.clj:1401)
> at
> org.apache.storm.daemon.nimbus$mk_reified_nimbus$reify__9612.getClusterInfo(nimbus.clj:1838)
> at
> org.apache.storm.generated.Nimbus$Processor$getClusterInfo.getResult(Nimbus.java:3724)
> at
> org.apache.storm.generated.Nimbus$Processor$getClusterInfo.getResult(Nimbus.java:3708)
> at
> org.apache.storm.thrift.ProcessFunction.process(ProcessFunction.java:39)
> ...
> 2016-07-17 08:47:49.397 o.a.s.b.BlobStoreUtils [ERROR] Could not download
> blob with keyproduction-topology-2-1468745167-stormconf.ser
> 2016-07-17 08:47:49.400 o.a.s.b.BlobStoreUtils [ERROR] Could not update the
> blob with keyproduction-topology-2-1468745167-stormconf.ser
> 2016-07-17 08:47:49.402 o.a.s.d.nimbus [ERROR] Error when processing event
> KeyNotFoundException(msg:production-topology-2-1468745167-stormconf.ser)
> at
> org.apache.storm.blobstore.LocalFsBlobStore.getStoredBlobMeta(LocalFsBlobStore.java:149)
> at
> org.apache.storm.blobstore.LocalFsBlobStore.getBlob(LocalFsBlobStore.java:239)
> at org.apache.storm.blobstore.BlobStore.readBlobTo(BlobStore.java:271)
> at org.apache.storm.blobstore.BlobStore.readBlob(BlobStore.java:300)
> ...
> at clojure.lang.Reflector.invokeMatchingMethod(Reflector.java:93)
> at clojure.lang.Reflector.invokeInstanceMethod(Reflector.java:28)
> at
> org.apache.storm.daemon.nimbus$read_storm_conf_as_nimbus.invoke(nimbus.clj:548)
> at
> org.apache.storm.daemon.nimbus$read_topology_details.invoke(nimbus.clj:555)
> at
> org.apache.storm.daemon.nimbus$mk_assignments$iter__9205__9209$fn__9210.invoke(nimbus.clj:912)
> ...
> at
> org.apache.storm.daemon.nimbus$mk_assignments.doInvoke(nimbus.clj:911)
> at clojure.lang.RestFn.invoke(RestFn.java:410)
> at
> org.apache.storm.daemon.nimbus$fn__9769$exec_fn__1363__auto____9770$fn__9781$fn__9782.invoke(nimbus.clj:2216)
> at
> org.apache.storm.daemon.nimbus$fn__9769$exec_fn__1363__auto____9770$fn__9781.invoke(nimbus.clj:2215)
> at
> org.apache.storm.timer$schedule_recurring$this__1732.invoke(timer.clj:105)
> at
> org.apache.storm.timer$mk_timer$fn__1715$fn__1716.invoke(timer.clj:50)
> at org.apache.storm.timer$mk_timer$fn__1715.invoke(timer.clj:42)
> ...
> 2016-07-17 08:47:49.408 o.a.s.util [ERROR] Halting process: ("Error when
> processing an event")
> java.lang.RuntimeException: ("Error when processing an event")
> at org.apache.storm.util$exit_process_BANG_.doInvoke(util.clj:341)
> at clojure.lang.RestFn.invoke(RestFn.java:423)
> at
> org.apache.storm.daemon.nimbus$nimbus_data$fn__8727.invoke(nimbus.clj:205)
> at
> org.apache.storm.timer$mk_timer$fn__1715$fn__1716.invoke(timer.clj:71)
> at org.apache.storm.timer$mk_timer$fn__1715.invoke(timer.clj:42)
> at clojure.lang.AFn.run(AFn.java:22)
> at java.lang.Thread.run(Thread.java:745)
> 2016-07-17 08:47:49.410 o.a.s.d.nimbus [INFO] Shutting down master
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)