Hello Jungtaek,
I confirm that we currently do not have multiple Nimbus nodes.
I want to clarify that Nimbus process never crashed : it keep printing in
its log the error:
2017-08-06 03:44:01.777 o.a.s.t.ProcessFunction pool-14-thread-1 [ERROR]
Internal error processing getClusterInfo
org.apache.storm.generated.KeyNotFoundException: null
... (rest of the stack trace)
However I forgot to mention that:
* During this problem our topologies are unaffected
* We notice the problem because we have a self-healing check on Nimbus-UI
process which consists in calling Web Services to get a few stats on
running topologies. But when Nimbus-UI is unresponsive, we kill it and
restart it automatically.
* The period of our NimbusUI self healing cron is 1 minute
* When the getClusterInfo stack trace occurs on Nimbus log, Nimbus UI is
unable to restart
* Here's the kind of exceptions we see in ui.log (Nimbus UI's trace):
2017-08-24 21:24:21.777 o.a.s.u.core main [INFO] Starting ui server for
storm version '1.1.0'
2017-08-24 21:24:21.788 o.a.s.d.m.MetricsUtils main [INFO] Using statistics
reporter
plugin:org.apache.storm.daemon.metrics.reporters.JmxPreparableReporter
2017-08-24 21:24:21.790 o.a.s.d.m.r.JmxPreparableReporter main [INFO]
Preparing...
2017-08-24 21:24:21.801 o.a.s.d.common main [INFO] Started statistics
report plugin...
2017-08-24 21:24:21.878 o.a.s.s.o.e.j.s.Server main [INFO]
jetty-7.x.y-SNAPSHOT
2017-08-24 21:24:21.929 o.a.s.s.o.e.j.s.h.ContextHandler main [INFO]
started o.a.s.s.o.e.j.s.ServletContextHandler{/,null}
2017-08-24 21:24:21.965 o.a.s.s.o.e.j.s.AbstractConnector main [INFO]
Started [email protected]:8070
2017-08-24 21:25:02.446 o.a.s.u.NimbusClient qtp2142536057-19 [WARN]
Ignoring exception while trying to get leader nimbus info from
ec2-52-48-59-151.eu-west-1.compute.amazonaws.com. will retry with a
different seed host.
org.apache.storm.thrift.transport.TTransportException: null
at
org.apache.storm.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)
~[storm-core-1.1.0.jar:1.1.0]
at
org.apache.storm.thrift.transport.TTransport.readAll(TTransport.java:86)
~[storm-core-1.1.0.jar:1.1.0]
at
org.apache.storm.thrift.transport.TFramedTransport.readFrame(TFramedTransport.java:129)
~[storm-core-1.1.0.jar:1.1.0]
at
org.apache.storm.thrift.transport.TFramedTransport.read(TFramedTransport.java:101)
~[storm-core-1.1.0.jar:1.1.0]
at
org.apache.storm.thrift.transport.TTransport.readAll(TTransport.java:86)
~[storm-core-1.1.0.jar:1.1.0]
at
org.apache.storm.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429)
~[storm-core-1.1.0.jar:1.1.0]
at
org.apache.storm.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:318)
~[storm-core-1.1.0.jar:1.1.0]
at
org.apache.storm.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:219)
~[storm-core-1.1.0.jar:1.1.0]
at
org.apache.storm.thrift.TServiceClient.receiveBase(TServiceClient.java:77)
~[storm-core-1.1.0.jar:1.1.0]
at
org.apache.storm.generated.Nimbus$Client.recv_getLeader(Nimbus.java:1193)
~[storm-core-1.1.0.jar:1.1.0]
at
org.apache.storm.generated.Nimbus$Client.getLeader(Nimbus.java:1181)
~[storm-core-1.1.0.jar:1.1.0]
at
org.apache.storm.utils.NimbusClient.getConfiguredClientAs(NimbusClient.java:84)
~[storm-core-1.1.0.jar:1.1.0]
at org.apache.storm.ui.core$cluster_summary.invoke(core.clj:355)
~[storm-core-1.1.0.jar:1.1.0]
at org.apache.storm.ui.core$fn__9556.invoke(core.clj:1113)
~[storm-core-1.1.0.jar:1.1.0]
at
org.apache.storm.shade.compojure.core$make_route$fn__5976.invoke(core.clj:100)
~[storm-core-1.1.0.jar:1.1.0]
at
org.apache.storm.shade.compojure.core$if_route$fn__5964.invoke(core.clj:46)
~[storm-core-1.1.0.jar:1.1.0]
at
org.apache.storm.shade.compojure.core$if_method$fn__5957.invoke(core.clj:31)
~[storm-core-1.1.0.jar:1.1.0]
at
org.apache.storm.shade.compojure.core$routing$fn__5982.invoke(core.clj:113)
~[storm-core-1.1.0.jar:1.1.0]
at clojure.core$some.invoke(core.clj:2570) ~[clojure-1.7.0.jar:?]
at
org.apache.storm.shade.compojure.core$routing.doInvoke(core.clj:113)
~[storm-core-1.1.0.jar:1.1.0]
at clojure.lang.RestFn.applyTo(RestFn.java:139)
~[clojure-1.7.0.jar:?]
at clojure.core$apply.invoke(core.clj:632) ~[clojure-1.7.0.jar:?]
So our problem is that Nimbus stays forever in a inconsistent state without
crashing, and if it would crash, then it would be helpless since we would
have automatic restart of it (we also have a self healing on Nimbus) but
again in such case we need to manually cleanup the content in the ZKs to be
able to restart Nimbus in a clean state.
My question is the following one: is there a way to make Nimbus more
resilient to such issues?
Best regards,
Alexandre
2017-08-24 23:06 GMT+02:00 Jungtaek Lim <[email protected]>:
> Blob files (meta, data) are in storm local directory. ZK only has list of
> blob keys and which alive nimbuses have that file. So if you lose storm
> local directory, you just can't restore blobs, unless other nimbuses have
> these blobs so current nimbus could pull.
> (I guess you have only one nimbus, since replication factor was 1 in nimbus
> log.)
>
> If FNF occurs while finding meta file, we just throw KeyNotFoundException
> because it is just same as not exist.
>
> Anyway you have to clear up everything because there's inconsistency
> between ZK and local, and no way to recover missed blob files so no way to
> make it consistent.
>
> Btw, in this case, my expectation is that after crashing and restarting,
> nimbus can't be a leader and just waits for missing blobs. If you saw
> Nimbus crashed again while relaunching, it could be a bug.
>
> - Jungtaek Lim (HeartSaVioR)
> On Fri, 25 Aug 2017 at 3:35 AM Alexandre Vermeerbergen <
> [email protected]> wrote:
>
> > Hello Jungtaek,
> >
> > I can do what you suggest (ie moving storm local dir to a place which
> isn't
> > in /tmp),but since the issue occurs rarely (once per month), I doubt I'll
> > be able to feedback soon.
> >
> > What is puzzling to me is that in order to recover from such issue, we
> have
> > to stop everything, then clean the Zookeepers, then restarts all Storm
> > processes and finally restart our topologies. We don't clean the /tmp
> > directory.
> >
> > Are you sure there's no relationships between Zookeeper contents and what
> > is causing the following message to happen?
> > "Internal error processing getClusterInfo"
> > Best regards;
> > Alexandre
> >
> > 2017-08-24 13:41 GMT+02:00 Jungtaek Lim <[email protected]>:
> >
> > > Sorry I meant reproduce, not replicate. :)
> > >
> > > 2017년 8월 24일 (목) 오후 8:34, Jungtaek Lim <[email protected]>님이 작성:
> > >
> > > > Alexandre,
> > > >
> > > > I found that your storm local dir is placed to "/tmp/storm" which
> parts
> > > or
> > > > all could be removed at any time.
> > > > Could you move the path to non-temporary place and try to replicate?
> > > >
> > > > Thanks,
> > > > Jungtaek Lim (HeartSaVioR)
> > > >
> > > >
> > > > 2017년 8월 24일 (목) 오후 6:40, Alexandre Vermeerbergen <
> > > > [email protected]>님이 작성:
> > > >
> > > >> Hello Jungtaek,
> > > >>
> > > >> Thank you very much for your answer.
> > > >>
> > > >> Please find attached the full Nimbus log (gzipped) related to this
> > > issue.
> > > >>
> > > >> Please note that the last ERROR repeats forever until we "repair"
> > Storm.
> > > >>
> > > >> From the logs, it could be that the issue began close to when a
> > topology
> > > >> was restarted (killed, then started)
> > > >>
> > > >> Maybe this caused a corruption in Zookeeper. Is there anything
> which I
> > > >> can collect in our Zookeeper nodes/logs to help analysis?
> > > >>
> > > >> Best regards,
> > > >> Alexandre
> > > >>
> > > >>
> > > >>
> > > >>
> > > >> 2017-08-24 9:29 GMT+02:00 Jungtaek Lim <[email protected]>:
> > > >>
> > > >>> Hi Alexandre, I missed this mail since I was on vacation.
> > > >>>
> > > >>> I followed the stack trace but hard to analyze without context. Do
> > you
> > > >>> mind
> > > >>> providing full nimbus log?
> > > >>>
> > > >>> Thanks,
> > > >>> Jungtaek Lim (HeartSaVioR)
> > > >>>
> > > >>> 2017년 8월 16일 (수) 오전 4:12, Alexandre Vermeerbergen <
> > > >>> [email protected]>님이
> > > >>> 작성:
> > > >>>
> > > >>> > Hello,
> > > >>> >
> > > >>> > Tomorrow I will have to restart the cluster on which I have this
> > > issue
> > > >>> with
> > > >>> > Storm 1.1.0.
> > > >>> > Is there are anybody interested in my running some commands to
> get
> > > more
> > > >>> > logs before I repair this cluster?
> > > >>> >
> > > >>> > Best regards,
> > > >>> > Alexandre Vermeerbergen
> > > >>> >
> > > >>> > 2017-08-13 16:14 GMT+02:00 Alexandre Vermeerbergen <
> > > >>> > [email protected]
> > > >>> > >:
> > > >>> >
> > > >>> > > Hello,
> > > >>> > >
> > > >>> > > I think it might be of interest for you Storm developers to
> learn
> > > >>> that I
> > > >>> > > currently have a case of issue with Storm 1.1.0 which was
> > supposed
> > > to
> > > >>> > > resolved in this release according to
> https://issues.apache.org/
> > > >>> > > jira/browse/STORM-1977 ; and I can look for any more
> information
> > > >>> which
> > > >>> > > you'd need to make a diagnostic on why this issue still can
> > happen.
> > > >>> > >
> > > >>> > > Indeed, I have a Storm UI process which can't get any
> information
> > > on
> > > >>> its
> > > >>> > > Storm cluster, and I see many following exception in
> nimbus.log:
> > > >>> > >
> > > >>> > > 2017-08-02 05:11:15.971 o.a.s.d.nimbus pool-14-thread-21 [INFO]
> > > >>> Created
> > > >>> > > download session for
> > > >>> statefulAlerting_ec2-52-51-199-56-eu-west-1-compute-
> > > >>> > > amazonaws-com_defaultStormTopic-29-1501650673-stormcode.ser
> with
> > > id
> > > >>> > > d5120ad7-a81c-4c39-afc5-a7f876b04c73
> > > >>> > > 2017-08-02 05:11:15.978 o.a.s.d.nimbus pool-14-thread-27 [INFO]
> > > >>> Created
> > > >>> > > download session for
> > > >>> statefulAlerting_ec2-52-51-199-56-eu-west-1-compute-
> > > >>> > > amazonaws-com_defaultStormTopic-29-1501650673-stormconf.ser
> with
> > > id
> > > >>> > > aba18011-3258-4023-bbef-14d21a7066e1
> > > >>> > > 2017-08-02 06:20:59.208 o.a.s.d.nimbus timer [INFO] Cleaning
> > inbox
> > > >>> ...
> > > >>> > > deleted: stormjar-fbdadeab-105d-4510-9beb-0f0d87e1a77d.jar
> > > >>> > > 2017-08-06 03:42:02.854 o.a.s.t.ProcessFunction
> pool-14-thread-34
> > > >>> [ERROR]
> > > >>> > > Internal error processing getClusterInfo
> > > >>> > > org.apache.storm.generated.KeyNotFoundException: null
> > > >>> > > at
> > > >>> >
> > > >>> org.apache.storm.blobstore.LocalFsBlobStore.getStoredBlobMeta(
> > > LocalFsBlobStore.java:147)
> > > >>> > > ~[storm-core-1.1.0.jar:1.1.0]
> > > >>> > > at
> > > >>> >
> > > >>> org.apache.storm.blobstore.LocalFsBlobStore.getBlobReplication(
> > > LocalFsBlobStore.java:299)
> > > >>> > > ~[storm-core-1.1.0.jar:1.1.0]
> > > >>> > > at sun.reflect.GeneratedMethodAccessor78.
> invoke(Unknown
> > > >>> Source)
> > > >>> > > ~[?:?]
> > > >>> > > at sun.reflect.DelegatingMethodAccessorImpl.invoke(
> > > >>> > > DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_121]
> > > >>> > > at java.lang.reflect.Method.invoke(Method.java:498)
> > > >>> > ~[?:1.8.0_121]
> > > >>> > > at
> > > >>> clojure.lang.Reflector.invokeMatchingMethod(Reflector.java:93)
> > > >>> > > ~[clojure-1.7.0.jar:?]
> > > >>> > > at
> > > >>> clojure.lang.Reflector.invokeInstanceMethod(Reflector.java:28)
> > > >>> > > ~[clojure-1.7.0.jar:?]
> > > >>> > > at
> > > >>> >
> > > >>> org.apache.storm.daemon.nimbus$get_blob_replication_
> > > count.invoke(nimbus.clj:489)
> > > >>> > > ~[storm-core-1.1.0.jar:1.1.0]
> > > >>> > > at org.apache.storm.daemon.
> nimbus$get_cluster_info$iter__
> > > >>> > > 10687__10691$fn__10692.invoke(nimbus.clj:1550)
> > > >>> > > ~[storm-core-1.1.0.jar:1.1.0]
> > > >>> > > at clojure.lang.LazySeq.sval(LazySeq.java:40)
> > > >>> > > ~[clojure-1.7.0.jar:?]
> > > >>> > > at clojure.lang.LazySeq.seq(LazySeq.java:49)
> > > >>> > > ~[clojure-1.7.0.jar:?]
> > > >>> > > at clojure.lang.RT.seq(RT.java:507)
> > ~[clojure-1.7.0.jar:?]
> > > >>> > > at clojure.core$seq__4128.invoke(core.clj:137)
> > > >>> > > ~[clojure-1.7.0.jar:?]
> > > >>> > > at clojure.core$dorun.invoke(core.clj:3009)
> > > >>> > ~[clojure-1.7.0.jar:?]
> > > >>> > > at clojure.core$doall.invoke(core.clj:3025)
> > > >>> > ~[clojure-1.7.0.jar:?]
> > > >>> > > at
> > > >>> > org.apache.storm.daemon.nimbus$get_cluster_info.
> > > invoke(nimbus.clj:1524)
> > > >>> > > ~[storm-core-1.1.0.jar:1.1.0]
> > > >>> > > at org.apache.storm.daemon.nimbus$mk_reified_nimbus$
> > > >>> > > reify__10782.getClusterInfo(nimbus.clj:1971)
> > > >>> > ~[storm-core-1.1.0.jar:1.1.0]
> > > >>> > > at org.apache.storm.generated.Nimbus$Processor$
> > > >>> > > getClusterInfo.getResult(Nimbus.java:3920)
> > > >>> ~[storm-core-1.1.0.jar:1.1.0]
> > > >>> > > at org.apache.storm.generated.Nimbus$Processor$
> > > >>> > > getClusterInfo.getResult(Nimbus.java:3904)
> > > >>> ~[storm-core-1.1.0.jar:1.1.0]
> > > >>> > > at
> > > >>> >
> > > >>> org.apache.storm.thrift.ProcessFunction.process(
> > > ProcessFunction.java:39)
> > > >>> > > ~[storm-core-1.1.0.jar:1.1.0]
> > > >>> > > at
> > > >>> > org.apache.storm.thrift.TBaseProcessor.process(
> > > TBaseProcessor.java:39)
> > > >>> > > ~[storm-core-1.1.0.jar:1.1.0]
> > > >>> > > at org.apache.storm.security.
> auth.SimpleTransportPlugin$
> > > >>> > > SimpleWrapProcessor.process(SimpleTransportPlugin.java:162)
> > > >>> > > ~[storm-core-1.1.0.jar:1.1.0]
> > > >>> > > at org.apache.storm.thrift.server.
> > > AbstractNonblockingServer$
> > > >>> > > FrameBuffer.invoke(AbstractNonblockingServer.java:518)
> > > >>> > > ~[storm-core-1.1.0.jar:1.1.0]
> > > >>> > > at
> > > >>> > org.apache.storm.thrift.server.Invocation.run(
> Invocation.java:18)
> > > >>> > > ~[storm-core-1.1.0.jar:1.1.0]
> > > >>> > > at
> > > >>> >
> > > >>> java.util.concurrent.ThreadPoolExecutor.runWorker(
> > > ThreadPoolExecutor.java:1142)
> > > >>> > > [?:1.8.0_121]
> > > >>> > > at
> > > >>> >
> > > >>> java.util.concurrent.ThreadPoolExecutor$Worker.run(
> > > ThreadPoolExecutor.java:617)
> > > >>> > > [?:1.8.0_121]
> > > >>> > > at java.lang.Thread.run(Thread.java:745) [?:1.8.0_121]
> > > >>> > >
> > > >>> > >
> > > >>> > > What is amazing is that I got this same issue on two Storm
> > clusters
> > > >>> > > running on different VMs ; just they share the same data in
> they
> > > >>> Kafka
> > > >>> > > Broker cluster (one cluster is the production one, which was
> > > quickly
> > > >>> > fixed,
> > > >>> > > and the other one is the "backup" cluster to be used if the
> > > >>> production
> > > >>> > one
> > > >>> > > fails for quick "back to production")
> > > >>> > >
> > > >>> > > If left one of these cluster with this behavior because I felt
> > that
> > > >>> it
> > > >>> > > could be interesting for Storm developers to have more
> > information
> > > on
> > > >>> > this
> > > >>> > > issue, if needed to properly diagnose it.
> > > >>> > >
> > > >>> > > I can keep this cluster as is for max 2 days.
> > > >>> > >
> > > >>> > > Is there anything useful which I can collect on it to help
> Storm
> > > >>> > > developers to understand the cause (and hopefully use it to
> make
> > > >>> Storm
> > > >>> > more
> > > >>> > > robust) ?
> > > >>> > >
> > > >>> > > Few details:
> > > >>> > >
> > > >>> > > * Storm 1.1.0 cluster with Nimbus & NimbusUI running on a VM,
> > and 4
> > > >>> > > Supervisors VMs + 3 Zookeeper VMs
> > > >>> > >
> > > >>> > > * Running with Java Server JRE 1.8.0_121
> > > >>> > > * Running on AWS EC2 instances
> > > >>> > >
> > > >>> > > * We run about 10 topologies, with automatic self-healing on
> them
> > > (if
> > > >>> > they
> > > >>> > > stop consuming Kafka items, our self-healer call "Kill
> topology",
> > > and
> > > >>> > then
> > > >>> > > eventually restarts the topology
> > > >>> > >
> > > >>> > > * We have a self-healing on Nimbus UI based on calling its REST
> > > >>> services.
> > > >>> > > If it's not responding fast enough, we restart Nimbus UI
> > > >>> > > * We figured out the issue because Nimbus UI was restarted
> every
> > 2
> > > >>> > minutes
> > > >>> > >
> > > >>> > > * To fix our production server which had the same symptom, we
> had
> > > to
> > > >>> stop
> > > >>> > > all Storm processes, then stop all Zookeepers, then remove all
> > data
> > > >>> in
> > > >>> > > Zookeeper "snapshot files", then restart all Zookeeper, then
> > > restart
> > > >>> all
> > > >>> > > Storm process, then re-submit all our topologies
> > > >>> > >
> > > >>> > > Please be as clear as possible about which commands we should
> run
> > > to
> > > >>> give
> > > >>> > > you more details if needed
> > > >>> > >
> > > >>> > > Best regards,
> > > >>> > > Alexandre Vermeerbergen
> > > >>> > >
> > > >>> > >
> > > >>> >
> > > >>>
> > > >>
> > > >>
> > >
> >
>