[
https://issues.apache.org/jira/browse/STORM-144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Rick Kellogg updated STORM-144:
-------------------------------
Component/s: storm-core
> Provide visibiilty into buffers between components
> --------------------------------------------------
>
> Key: STORM-144
> URL: https://issues.apache.org/jira/browse/STORM-144
> Project: Apache Storm
> Issue Type: Improvement
> Components: storm-core
> Reporter: James Xu
> Priority: Minor
>
> https://github.com/nathanmarz/storm/issues/222
> It would be nice to see how many tuples are in the input and output buffers
> of Storm components to understand where things are getting bottled up. 0mq
> doesn't currently provide this visibility so it's not clear how to implement
> this.
> ----------
> nahap: maybe now that you have internal message buffers in storm 0.8 you
> could use these as an indicator. it is not perfect but better than nothing
> ----------
> dkincaid: Based on my understanding of how messages get moved between bolts I
> think there are two places that they can get "stuck" in a queue. The first is
> in the inbound and outbound worker message queues. Prior to 0.8.0 those were
> unbounded LinkedBlockingQueue's. Version 0.8.0 changed to use LMAX Disruptor
> queues which are bounded. At this point then we should focus on the Disruptor
> queues.
> The second place messages can get "stuck" is in the ZeroMQ sockets that are
> used to send messages between machines.
> It seems to me that the first thing to do here would be to provide visibility
> into the size of the Disruptor queues in some manner.
> Next, we should look for a way to provide some visibility into the queuing of
> messages within ZeroMQ. I'm far from an expert on ZeroMQ, but from looking at
> the documentation for the zmq_getsockopt call it looks promising:
> ZMQ_BACKLOG: Retrieve maximum length of the queue of outstanding connections
> The ZMQ_BACKLOG option shall retrieve the maximum length of the queue of
> outstanding peer connections for the specified socket; this only applies to
> connection-oriented transports. For details refer to your operating system
> documentation for the listen function.
> Maybe that won't show the actual number of messages waiting in the queue, but
> should still be an indication of a backup.
> Since the rest of the stats for workers, bolts, etc are sent to Zookeeper
> does it make sense to send a snapshot count of these queues at the same time?
> Personally I'd like to be able to see average size over the time period as
> well as max and min, but then we'd be starting to throw more data into
> Zookeeper which Nathan has been trying to prune.
> -----------
> sustrik: ZMQ_BACKLOG is listen function's 'backlog' parameter and has nothing
> to do with queued messages.
> Btw, even without queueing on ZeroMQ layer there's still queueing going on on
> the lower layers (TCP) which is kind of hard to assess. The only reasonable
> solution, AFAICS, is to hard-limit the buffers (on all layers) and consider
> the max buffered amount of messages to be the error of the queue depth
> measurement.
> Say, if you are using raw TCP and it's possible to buffer 100 messages in
> TCP's tx and rx buffers, you can measure the number of outstanding messages
> buffered in the application and report (N, N+100) interval as the queue depth.
> The problem gets more complex when there are many TCP connections involved.
> If there are 1000 connections the (N, N+100) interval expands to (N,
> N+100,000).
> ----------
> mrflip: Fixed by #633 ?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)