[ 
https://issues.apache.org/jira/browse/STORM-144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rick Kellogg updated STORM-144:
-------------------------------
    Component/s: storm-core

> Provide visibiilty into buffers between components
> --------------------------------------------------
>
>                 Key: STORM-144
>                 URL: https://issues.apache.org/jira/browse/STORM-144
>             Project: Apache Storm
>          Issue Type: Improvement
>          Components: storm-core
>            Reporter: James Xu
>            Priority: Minor
>
> https://github.com/nathanmarz/storm/issues/222
> It would be nice to see how many tuples are in the input and output buffers 
> of Storm components to understand where things are getting bottled up. 0mq 
> doesn't currently provide this visibility so it's not clear how to implement 
> this.
> ----------
> nahap: maybe now that you have internal message buffers in storm 0.8 you 
> could use these as an indicator. it is not perfect but better than nothing
> ----------
> dkincaid: Based on my understanding of how messages get moved between bolts I 
> think there are two places that they can get "stuck" in a queue. The first is 
> in the inbound and outbound worker message queues. Prior to 0.8.0 those were 
> unbounded LinkedBlockingQueue's. Version 0.8.0 changed to use LMAX Disruptor 
> queues which are bounded. At this point then we should focus on the Disruptor 
> queues.
> The second place messages can get "stuck" is in the ZeroMQ sockets that are 
> used to send messages between machines.
> It seems to me that the first thing to do here would be to provide visibility 
> into the size of the Disruptor queues in some manner.
> Next, we should look for a way to provide some visibility into the queuing of 
> messages within ZeroMQ. I'm far from an expert on ZeroMQ, but from looking at 
> the documentation for the zmq_getsockopt call it looks promising:
> ZMQ_BACKLOG: Retrieve maximum length of the queue of outstanding connections
> The ZMQ_BACKLOG option shall retrieve the maximum length of the queue of 
> outstanding peer connections for the specified socket; this only applies to 
> connection-oriented transports. For details refer to your operating system 
> documentation for the listen function.
> Maybe that won't show the actual number of messages waiting in the queue, but 
> should still be an indication of a backup.
> Since the rest of the stats for workers, bolts, etc are sent to Zookeeper 
> does it make sense to send a snapshot count of these queues at the same time? 
> Personally I'd like to be able to see average size over the time period as 
> well as max and min, but then we'd be starting to throw more data into 
> Zookeeper which Nathan has been trying to prune.
> -----------
> sustrik: ZMQ_BACKLOG is listen function's 'backlog' parameter and has nothing 
> to do with queued messages.
> Btw, even without queueing on ZeroMQ layer there's still queueing going on on 
> the lower layers (TCP) which is kind of hard to assess. The only reasonable 
> solution, AFAICS, is to hard-limit the buffers (on all layers) and consider 
> the max buffered amount of messages to be the error of the queue depth 
> measurement.
> Say, if you are using raw TCP and it's possible to buffer 100 messages in 
> TCP's tx and rx buffers, you can measure the number of outstanding messages 
> buffered in the application and report (N, N+100) interval as the queue depth.
> The problem gets more complex when there are many TCP connections involved. 
> If there are 1000 connections the (N, N+100) interval expands to (N, 
> N+100,000).
> ----------
> mrflip: Fixed by #633 ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to