[jira] [Commented] (STORM-1977) Leader Nimbus crashes with getClusterInfo when it doesn't have one or more replicated topology codes

ASF GitHub Bot (JIRA) Tue, 19 Jul 2016 06:50:07 -0700

    [ 
https://issues.apache.org/jira/browse/STORM-1977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15384166#comment-15384166
 ]


ASF GitHub Bot commented on STORM-1977:
---------------------------------------

Github user revans2 commented on the issue:

    https://github.com/apache/storm/pull/1574
  
    @HeartSaVioR the local file system blobstore behaves exactly the same as 
the nimbus HA before it.  It tries to replicate all of the blobs to all of the 
nimbus nodes as quickly as possible.  It stores the metadata for each blob in 
ZK, just like the original nimbus HA code.  The big difference is that it can 
store a general set of blobs and they are all versioned so you can upload a new 
one, and if you ask for a blob that is not currently local on that nimbus it 
will speed up they sync process and go grab it from another nimbus instance for 
you.
    
    There is still a race, like with nimbus HA where it may not have fully 
replicated, but if you set the replication count when the blob is uploaded it 
should be waiting for the replication to complete before declaring success.
    
    The one bug I saw while going through the code is that when we list keys, 
we are doing it only from the local storage, not form ZK.  If we were doing it 
from ZK then when someone asks for all of the keys they would be guaranteed to 
get all of the keys, but this patch would do no good. Simply because it would 
not have an API to know what is local and what is not local.  In the short term 
I think this is OK, but long term we need to discuss how we really want all of 
this to work in the different failure cases.
    
    When configured to use HDFS as the backing, all of the "blobs" are stored 
in HDFS.  We do a directory listing to get the key listing.  With that nimbus 
truly becomes stateless, and you can stand up a nimbus on a different node and 
not have to worry about it.  This code would simply do the directory listing 
and then become a noop.


> Leader Nimbus crashes with getClusterInfo when it doesn't have one or more 
> replicated topology codes
> ----------------------------------------------------------------------------------------------------
>
>                 Key: STORM-1977
>                 URL: https://issues.apache.org/jira/browse/STORM-1977
>             Project: Apache Storm
>          Issue Type: Bug
>          Components: storm-core
>    Affects Versions: 1.0.0, 1.0.1
>            Reporter: Jungtaek Lim
>            Assignee: Jungtaek Lim
>            Priority: Critical
>
> While investigating STORM-1976, I found that there're cases for nimbus to not 
> having topology codes. 
> Before BlobStore, only nimbuses which is having all topology codes can gain 
> leadership, otherwise they give up leadership immediately. While introducing 
> BlobStore, this logic is removed.
> I don't know it's intended or not, but it incurs one of nimbus to gain 
> leadership which doesn't have replicated topology code, and the nimbus will 
> be crashed when getClusterInfo is requested.
> Easiest way to reproduce is:
> 1. comment cleanup-corrupt-topologies! from nimbus.clj (It's a quick 
> workaround for resolving STORM-1976), and patch Storm cluster
> 2. Launch Nimbus 1 (leader)
> 3. Run topology
> 4. Kill Nimbus 1
> 5. Launch Nimbus 2 from different node
> 6. Nimbus 2 gains leadership 
> 7. getClusterInfo is requested to Nimbus 2, and Nimbus 2 gets crashed
> Log:
> {code}
> 2016-07-17 08:47:48.378 o.a.s.b.FileBlobStoreImpl [INFO] Creating new blob 
> store based in /grid/0/hadoop/storm/blobs
> ...
> 2016-07-17 08:47:48.619 o.a.s.zookeeper [INFO] Queued up for leader lock.
> 2016-07-17 08:47:48.651 o.a.s.zookeeper [INFO] <node1> gained leadership
> ...
> 2016-07-17 08:47:48.833 o.a.s.d.nimbus [INFO] Starting nimbus server for 
> storm version '1.1.1-SNAPSHOT'
> 2016-07-17 08:47:49.295 o.a.s.t.ProcessFunction [ERROR] Internal error 
> processing getClusterInfo
> KeyNotFoundException(msg:production-topology-2-1468745167-stormcode.ser)
>         at 
> org.apache.storm.blobstore.LocalFsBlobStore.getStoredBlobMeta(LocalFsBlobStore.java:149)
>         at 
> org.apache.storm.blobstore.LocalFsBlobStore.getBlobReplication(LocalFsBlobStore.java:268)
> ...
>         at 
> org.apache.storm.daemon.nimbus$get_blob_replication_count.invoke(nimbus.clj:498)
>         at 
> org.apache.storm.daemon.nimbus$get_cluster_info$iter__9520__9524$fn__9525.invoke(nimbus.clj:1427)
> ...
>         at 
> org.apache.storm.daemon.nimbus$get_cluster_info.invoke(nimbus.clj:1401)
>         at 
> org.apache.storm.daemon.nimbus$mk_reified_nimbus$reify__9612.getClusterInfo(nimbus.clj:1838)
>         at 
> org.apache.storm.generated.Nimbus$Processor$getClusterInfo.getResult(Nimbus.java:3724)
>         at 
> org.apache.storm.generated.Nimbus$Processor$getClusterInfo.getResult(Nimbus.java:3708)
>         at 
> org.apache.storm.thrift.ProcessFunction.process(ProcessFunction.java:39)
> ...
> 2016-07-17 08:47:49.397 o.a.s.b.BlobStoreUtils [ERROR] Could not download 
> blob with keyproduction-topology-2-1468745167-stormconf.ser
> 2016-07-17 08:47:49.400 o.a.s.b.BlobStoreUtils [ERROR] Could not update the 
> blob with keyproduction-topology-2-1468745167-stormconf.ser
> 2016-07-17 08:47:49.402 o.a.s.d.nimbus [ERROR] Error when processing event
> KeyNotFoundException(msg:production-topology-2-1468745167-stormconf.ser)
>         at 
> org.apache.storm.blobstore.LocalFsBlobStore.getStoredBlobMeta(LocalFsBlobStore.java:149)
>         at 
> org.apache.storm.blobstore.LocalFsBlobStore.getBlob(LocalFsBlobStore.java:239)
>         at org.apache.storm.blobstore.BlobStore.readBlobTo(BlobStore.java:271)
>         at org.apache.storm.blobstore.BlobStore.readBlob(BlobStore.java:300)
> ...
>        at clojure.lang.Reflector.invokeMatchingMethod(Reflector.java:93)
>         at clojure.lang.Reflector.invokeInstanceMethod(Reflector.java:28)
>         at 
> org.apache.storm.daemon.nimbus$read_storm_conf_as_nimbus.invoke(nimbus.clj:548)
>         at 
> org.apache.storm.daemon.nimbus$read_topology_details.invoke(nimbus.clj:555)
>         at 
> org.apache.storm.daemon.nimbus$mk_assignments$iter__9205__9209$fn__9210.invoke(nimbus.clj:912)
> ...
>         at 
> org.apache.storm.daemon.nimbus$mk_assignments.doInvoke(nimbus.clj:911)
>         at clojure.lang.RestFn.invoke(RestFn.java:410)
>         at 
> org.apache.storm.daemon.nimbus$fn__9769$exec_fn__1363__auto____9770$fn__9781$fn__9782.invoke(nimbus.clj:2216)
>         at 
> org.apache.storm.daemon.nimbus$fn__9769$exec_fn__1363__auto____9770$fn__9781.invoke(nimbus.clj:2215)
>         at 
> org.apache.storm.timer$schedule_recurring$this__1732.invoke(timer.clj:105)
>         at 
> org.apache.storm.timer$mk_timer$fn__1715$fn__1716.invoke(timer.clj:50)
>         at org.apache.storm.timer$mk_timer$fn__1715.invoke(timer.clj:42)
> ...
> 2016-07-17 08:47:49.408 o.a.s.util [ERROR] Halting process: ("Error when 
> processing an event")
> java.lang.RuntimeException: ("Error when processing an event")
>         at org.apache.storm.util$exit_process_BANG_.doInvoke(util.clj:341)
>         at clojure.lang.RestFn.invoke(RestFn.java:423)
>         at 
> org.apache.storm.daemon.nimbus$nimbus_data$fn__8727.invoke(nimbus.clj:205)
>         at 
> org.apache.storm.timer$mk_timer$fn__1715$fn__1716.invoke(timer.clj:71)
>         at org.apache.storm.timer$mk_timer$fn__1715.invoke(timer.clj:42)
>         at clojure.lang.AFn.run(AFn.java:22)
>         at java.lang.Thread.run(Thread.java:745)
> 2016-07-17 08:47:49.410 o.a.s.d.nimbus [INFO] Shutting down master
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (STORM-1977) Leader Nimbus crashes with getClusterInfo when it doesn't have one or more replicated topology codes

Reply via email to