James Xu created STORM-148:
------------------------------

             Summary: Track Nimbus actions in UI
                 Key: STORM-148
                 URL: https://issues.apache.org/jira/browse/STORM-148
             Project: Apache Storm (Incubating)
          Issue Type: New Feature
            Reporter: James Xu
            Priority: Minor


https://github.com/nathanmarz/storm/issues/77

1. Worker reassignment history
2. Task timeout history

----------
danehammer: I feel like the logical next step would be to click on a supervisor 
from the main page, and get details about that supervisor node's going-ons. 
Workers running, their uptime, and the history you mention. Could even go one 
step further in, click on a worker, and see the executors/tasks running on that 
worker.

----------
cnardi: It would be really nice. Sometimes a worker is not behaving as expected 
(memory or cpu problems) and its important to know what it's being executed 
there. The only way so far is to go through all the bolts/spouts and see where 
it's being executed.

----------
danehammer: I've started familiarizing myself with what would be required to 
implement this. It feels like the part I'm thinking about, having the workers 
for every supervisor known, would require changes to the thrift API. I 
currently have no way of identifying an individual worker. I can get a 
supervisor, it can tell me the number of workers it has and how many are used, 
and executors know their host and port, but it feels like there should be a 
worker object between these two. A supervisor has a set of workers, and an 
executor lives on a worker. The worker has an uptime, port, host, id, as well 
as an understanding of its executors.

Sound right?

----------
nathanmarz: A worker is identified by its [supervisor id, port]. The uptime for 
a worker is the same as all its executors.

It would be useful to have a new Thrift method that gets the list of all 
workers in the cluster, including information such as:

Supervisor id and port
Host it's running on
Executors running in the worker
Once you have that method, you can easily implement supervisor pages. I think 
you should leave uptime out for now as that would require fetching the executor 
heartbeats, which is a very large amount of Zookeeper calls.

----------
danehammer: I would love to see from the UI if a worker's uptime is abnormal. 
Today if I hit the storm UI and a supervisor has recently gone down, it stands 
out immediately - its uptime is way lower than the other supervisors. I would 
imagine the same sort of "one of these does not belong" would be easily 
recognizable on a supervisor page.

The uptime for a worker is the same as all its executors
Would looking up one of these executor's heartbeats be a valid test of the 
worker's uptime? I take it this means the executor's heartbeats are what tell 
the supervisor the worker is up, and that the worker does not have its own 
heartbeat.

----------
nathanmarz: Well, all the executor heartbeats are kept in worker heartbeats. 
Fetching all the worker heartbeats for every topology is just going to be too 
expensive.

We can solve the heartbeat problem in the future by having the supervisor keep 
the uptime stats (from its perspective) in the supervisor heartbeat.




--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

Reply via email to