Are you using acking and/or do you have back-pressure enabled? Your worker
crashed because it exceeded the GC overhead limit which by default in java
means that you were spending moe than 98% of the time doing GC and only 2% of
the time doing real work. I am rather surprised that the supervisor didn't
shoot your worker and relaunch it. Because the worker will typically have
issues heartbeating in to the supervisor before it gets into this situation.
enabling acking with max spout pending set or turing on backpressure should
make it so your topology is less likely to die. You may also need to tune how
much memory you are giving your worker. Also I would really like to know the
stack trace on the exceptions you are seeing. They could be caused by high GC
overhead, or they could indicate that some other system you are talking to,
probably solr in this case, is in trouble and is closing connections
unexpectedly.
- Bobby
On Thursday, December 15, 2016 8:15 PM, S G <[email protected]>
wrote:
Hi,
I am using Storm 1.0.2
My configuration is quite simple: `kafka-spout` feeding to `solr-bolt`
topology.workers = 2
spout.parallelism = 1
bolt.parallelism = 1
Our messages coming from kafka are large: around 100kb per message to max
of 500kb per message.
But I see lots of errors:
Window Emitted Transferred Complete latency (ms) Acked
Failed
10m 0s 355,160 355,161 15,263
29,040 340,823
And after running for 30 minutes, the kafka-spout goes OutOfMemory:
java.lang.OutOfMemoryError: GC overhead limit exceeded
at org.apache.storm.kafka.PartitionManager.fail(PartitionManager.java:281)
at org.apache.storm.kafka.KafkaSpout.fail(KafkaSpout.java:173)
at org.apache.storm.daemon.executor$fail_spout_msg.invoke(executor.clj:439)
at org.apache.storm.daemon.executor$fn$reify__7993.expire(executor.clj:512)
at org.apache.storm.utils.RotatingMap.rotate(RotatingMap.java:77)
at
org.apache.storm.daemon.executor$fn__7990$tuple_action_fn__7996.invoke(executor.clj:517)
at
org.apache.storm.daemon.executor$mk_task_receiver$fn__7979.invoke(executor.clj:467)
at
org.apache.storm.disruptor$clojure_handler$reify__7492.onEvent(disruptor.clj:40)
at
org.apache.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:451)
at
org.apache.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:430)
at
org.apache.storm.utils.DisruptorQueue.consumeBatch(DisruptorQueue.java:420)
at org.apache.storm.disruptor$consume_batch.invoke(disruptor.clj:69)
at
org.apache.storm.daemon.executor$fn__7990$fn__8005$fn__8036.invoke(executor.clj:628)
at org.apache.storm.util$async_loop$fn__624.invoke(util.clj:484)
at clojure.lang.AFn.run(AFn.java:22)
at java.lang.Thread.run(Thread.java:745)
In the worker.log I see lots of ERRORs like (just in a duration of 30
minutes):
java.nio.channels.ClosedChannelException (24957 times)
java.net.ConnectException (107 times)
java.io.IOException (22 times)
How do I debug this?
Thanks
SG