[ 
https://issues.apache.org/jira/browse/STORM-148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rick Kellogg updated STORM-148:
-------------------------------
    Component/s: storm-core

> Track Nimbus actions in UI
> --------------------------
>
>                 Key: STORM-148
>                 URL: https://issues.apache.org/jira/browse/STORM-148
>             Project: Apache Storm
>          Issue Type: New Feature
>          Components: storm-core
>            Reporter: James Xu
>            Priority: Minor
>
> https://github.com/nathanmarz/storm/issues/77
> 1. Worker reassignment history
> 2. Task timeout history
> ----------
> danehammer: I feel like the logical next step would be to click on a 
> supervisor from the main page, and get details about that supervisor node's 
> going-ons. Workers running, their uptime, and the history you mention. Could 
> even go one step further in, click on a worker, and see the executors/tasks 
> running on that worker.
> ----------
> cnardi: It would be really nice. Sometimes a worker is not behaving as 
> expected (memory or cpu problems) and its important to know what it's being 
> executed there. The only way so far is to go through all the bolts/spouts and 
> see where it's being executed.
> ----------
> danehammer: I've started familiarizing myself with what would be required to 
> implement this. It feels like the part I'm thinking about, having the workers 
> for every supervisor known, would require changes to the thrift API. I 
> currently have no way of identifying an individual worker. I can get a 
> supervisor, it can tell me the number of workers it has and how many are 
> used, and executors know their host and port, but it feels like there should 
> be a worker object between these two. A supervisor has a set of workers, and 
> an executor lives on a worker. The worker has an uptime, port, host, id, as 
> well as an understanding of its executors.
> Sound right?
> ----------
> nathanmarz: A worker is identified by its [supervisor id, port]. The uptime 
> for a worker is the same as all its executors.
> It would be useful to have a new Thrift method that gets the list of all 
> workers in the cluster, including information such as:
> Supervisor id and port
> Host it's running on
> Executors running in the worker
> Once you have that method, you can easily implement supervisor pages. I think 
> you should leave uptime out for now as that would require fetching the 
> executor heartbeats, which is a very large amount of Zookeeper calls.
> ----------
> danehammer: I would love to see from the UI if a worker's uptime is abnormal. 
> Today if I hit the storm UI and a supervisor has recently gone down, it 
> stands out immediately - its uptime is way lower than the other supervisors. 
> I would imagine the same sort of "one of these does not belong" would be 
> easily recognizable on a supervisor page.
> The uptime for a worker is the same as all its executors
> Would looking up one of these executor's heartbeats be a valid test of the 
> worker's uptime? I take it this means the executor's heartbeats are what tell 
> the supervisor the worker is up, and that the worker does not have its own 
> heartbeat.
> ----------
> nathanmarz: Well, all the executor heartbeats are kept in worker heartbeats. 
> Fetching all the worker heartbeats for every topology is just going to be too 
> expensive.
> We can solve the heartbeat problem in the future by having the supervisor 
> keep the uptime stats (from its perspective) in the supervisor heartbeat.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to