James Xu created STORM-148:
------------------------------
Summary: Track Nimbus actions in UI
Key: STORM-148
URL: https://issues.apache.org/jira/browse/STORM-148
Project: Apache Storm (Incubating)
Issue Type: New Feature
Reporter: James Xu
Priority: Minor
https://github.com/nathanmarz/storm/issues/77
1. Worker reassignment history
2. Task timeout history
----------
danehammer: I feel like the logical next step would be to click on a supervisor
from the main page, and get details about that supervisor node's going-ons.
Workers running, their uptime, and the history you mention. Could even go one
step further in, click on a worker, and see the executors/tasks running on that
worker.
----------
cnardi: It would be really nice. Sometimes a worker is not behaving as expected
(memory or cpu problems) and its important to know what it's being executed
there. The only way so far is to go through all the bolts/spouts and see where
it's being executed.
----------
danehammer: I've started familiarizing myself with what would be required to
implement this. It feels like the part I'm thinking about, having the workers
for every supervisor known, would require changes to the thrift API. I
currently have no way of identifying an individual worker. I can get a
supervisor, it can tell me the number of workers it has and how many are used,
and executors know their host and port, but it feels like there should be a
worker object between these two. A supervisor has a set of workers, and an
executor lives on a worker. The worker has an uptime, port, host, id, as well
as an understanding of its executors.
Sound right?
----------
nathanmarz: A worker is identified by its [supervisor id, port]. The uptime for
a worker is the same as all its executors.
It would be useful to have a new Thrift method that gets the list of all
workers in the cluster, including information such as:
Supervisor id and port
Host it's running on
Executors running in the worker
Once you have that method, you can easily implement supervisor pages. I think
you should leave uptime out for now as that would require fetching the executor
heartbeats, which is a very large amount of Zookeeper calls.
----------
danehammer: I would love to see from the UI if a worker's uptime is abnormal.
Today if I hit the storm UI and a supervisor has recently gone down, it stands
out immediately - its uptime is way lower than the other supervisors. I would
imagine the same sort of "one of these does not belong" would be easily
recognizable on a supervisor page.
The uptime for a worker is the same as all its executors
Would looking up one of these executor's heartbeats be a valid test of the
worker's uptime? I take it this means the executor's heartbeats are what tell
the supervisor the worker is up, and that the worker does not have its own
heartbeat.
----------
nathanmarz: Well, all the executor heartbeats are kept in worker heartbeats.
Fetching all the worker heartbeats for every topology is just going to be too
expensive.
We can solve the heartbeat problem in the future by having the supervisor keep
the uptime stats (from its perspective) in the supervisor heartbeat.
--
This message was sent by Atlassian JIRA
(v6.1.4#6159)