[
https://issues.apache.org/jira/browse/STORM-148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Rick Kellogg updated STORM-148:
-------------------------------
Component/s: storm-core
> Track Nimbus actions in UI
> --------------------------
>
> Key: STORM-148
> URL: https://issues.apache.org/jira/browse/STORM-148
> Project: Apache Storm
> Issue Type: New Feature
> Components: storm-core
> Reporter: James Xu
> Priority: Minor
>
> https://github.com/nathanmarz/storm/issues/77
> 1. Worker reassignment history
> 2. Task timeout history
> ----------
> danehammer: I feel like the logical next step would be to click on a
> supervisor from the main page, and get details about that supervisor node's
> going-ons. Workers running, their uptime, and the history you mention. Could
> even go one step further in, click on a worker, and see the executors/tasks
> running on that worker.
> ----------
> cnardi: It would be really nice. Sometimes a worker is not behaving as
> expected (memory or cpu problems) and its important to know what it's being
> executed there. The only way so far is to go through all the bolts/spouts and
> see where it's being executed.
> ----------
> danehammer: I've started familiarizing myself with what would be required to
> implement this. It feels like the part I'm thinking about, having the workers
> for every supervisor known, would require changes to the thrift API. I
> currently have no way of identifying an individual worker. I can get a
> supervisor, it can tell me the number of workers it has and how many are
> used, and executors know their host and port, but it feels like there should
> be a worker object between these two. A supervisor has a set of workers, and
> an executor lives on a worker. The worker has an uptime, port, host, id, as
> well as an understanding of its executors.
> Sound right?
> ----------
> nathanmarz: A worker is identified by its [supervisor id, port]. The uptime
> for a worker is the same as all its executors.
> It would be useful to have a new Thrift method that gets the list of all
> workers in the cluster, including information such as:
> Supervisor id and port
> Host it's running on
> Executors running in the worker
> Once you have that method, you can easily implement supervisor pages. I think
> you should leave uptime out for now as that would require fetching the
> executor heartbeats, which is a very large amount of Zookeeper calls.
> ----------
> danehammer: I would love to see from the UI if a worker's uptime is abnormal.
> Today if I hit the storm UI and a supervisor has recently gone down, it
> stands out immediately - its uptime is way lower than the other supervisors.
> I would imagine the same sort of "one of these does not belong" would be
> easily recognizable on a supervisor page.
> The uptime for a worker is the same as all its executors
> Would looking up one of these executor's heartbeats be a valid test of the
> worker's uptime? I take it this means the executor's heartbeats are what tell
> the supervisor the worker is up, and that the worker does not have its own
> heartbeat.
> ----------
> nathanmarz: Well, all the executor heartbeats are kept in worker heartbeats.
> Fetching all the worker heartbeats for every topology is just going to be too
> expensive.
> We can solve the heartbeat problem in the future by having the supervisor
> keep the uptime stats (from its perspective) in the supervisor heartbeat.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)