May you already know about this, please also note that count of metrics tuples are linear with overall task count. Higher parallelism puts more pressure to the metrics bolt.
I guess Taylor and Alessandro have been working on metrics v2. Unless we finish metrics v2, we can just reduce the load with metrics whitelist / blacklist, and asynchronous metrics consumer bolt on upcoming Storm 1.1.0. (Before that you might would like to give a try to migrate to 1.x, say, 1.0.2 for now.) - Jungtaek Lim (HeartSaVioR) 2017년 1월 5일 (목) 오전 12:42, Bobby Evans <[email protected]>님이 작성: > Yes you are right that will not help. The best you can do now is to > increase the number of MetricsConsumer instances that you have. You can do > this when you register the metrics consumer. > conf.registerMetricsConsumer(NoOpMetricsConsumer.class, 3); > The default is 1, but we have see with very large topologies, or ones that > output a lot of metrics they can sometimes get bogged down. > You could also try profiling that worker to see what is taking so long. > If a NoOp is also showing the same signs it would be interesting to see > why. It could be the number of events coming in, or it could be the size > of the metrics being sent making deserialization costly. - Bobby > > On Tuesday, January 3, 2017 2:05 PM, Erik Weathers > <[email protected]> wrote: > > > Thanks for the response Bobby! > > I think I might have failed to sufficiently emphasize & explain something > in my earlier description of the issue: this is happening *only* in a > worker process that is hosting a bolt that implements the *IMetricsConsumer > *interface. The other 24 worker processes are working just fine, their > netty queues do not grow forever. The same number and type of executors > are on every worker process, except that one worker that is hosting the > metrics consumer bolt. > > So the netty queue is growing unbounded because of an influx of metrics. > The acking and max spout pending configs wouldn't seem to directly > influence the filling of the netty queue with custom metrics. > > Notably, this "choking" behavior happens even with a "NoOpMetricsConsumer" > bolt which is the same as storm's LoggingMetricsConsumer but with the > handleDataPoints() doing *nothing*. Interesting, right? > > - Erik > > On Tue, Jan 3, 2017 at 7:06 AM, Bobby Evans <[email protected]> > wrote: > > > Storm does not have back pressure by default. Also because storm > supports > > loops in a topology the message queues can grow unbounded. We have put > in > > a number of fixes in newer versions of storm, also for the messaging side > > of things. But the simplest way to avoid this is to have acking enabled > > and have max spout pending set to a reasonable number. This will > typically > > be caused by one of the executors in your worker not being able to keep > up > > with the load coming in. There is also the possibility that a single > > thread cannot keep up with the incoming message load. In the former > case > > you should be able to see the capacity go very high on some of the > > executors. In the latter case you will not see that, and may need to add > > more workers to your topology. - Bobby > > > > On Thursday, December 22, 2016 10:01 PM, Erik Weathers > > <[email protected]> wrote: > > > > > > We're debugging a topology's infinite memory growth for a worker process > > that is running a metrics consumer bolt, and we just noticed that the > netty > > Server.java's message_queue > > <https://github.com/apache/storm/blob/v0.9.6/storm-core/ > > src/jvm/backtype/storm/messaging/netty/Server.java#L97> > > is growing forever (at least it goes up to ~5GB before it hits heap > limits > > and leads to heavy GCing). (We found this by using Eclipse's Memory > > Analysis Tool on a heap dump obtained via jmap.) > > > > We're running storm-0.9.6, and this is happening with a topology that is > > processing 200K+ tuples per second, and producing a lot of metrics. > > > > I'm a bit surprised that this queue would grow forever, I assumed there > > would be some sort of limit. I'm pretty naive about how netty's message > > receiving system tied into the Storm executors at this point though. I'm > > kind of assuming the behavior could be a result of backpressure / > slowness > > from our downstream monitoring system, but there's no visibility provided > > by Storm into what's happening with these messages in the netty queues > > (that I have been able to ferret out at least!). > > > > Thanks for any input you might be able to provide! > > > > - Erik > > > > > > > > > > >
