Github user d2r commented on the pull request:
https://github.com/apache/storm/pull/753#issuecomment-155489896
> @d2r ,
> Thank you very much for your prompt response. However, I cannot quite
understand your meaning by
>
> If the previous worker's throughput stats had declined sharply before
the worker had died, then weighting the current worker's throughput stats still
would be inaccurate, but in a different way.
>
> I will appreciate it a lot if you could provide a concrete example.
Suppose an executors throughput behaves such that it starts at a normal
level, but declines for a while until the worker dies. (This could happen if
there is a leak leading to JVM heap exhaustion, leading to frequent major
garbage collections and then to the worker being killed.) When the worker
restarts, it will restart with an empty metrics state and will begin again at
normal throughput. What we would see is a weighted metric that seems to say
the executor has had a normal throughput for the entire duration of the window,
when in reality the executor's throughput was lower.
The other metrics we use so far reset the count when the new worker is
started, and they do not extrapolate for the full time window. So adding a
throughput metric that does not behave this way would be inconsistent. The
throughput numbers would not correlate with executed and transferred numbers,
for example.
This is the inconsistency I was referring to. Both methods can be
inaccurate, but they are inconsistent when both are used together. I would
rather the UI be self-consistent, for our users' sakes. I would like to avoid
mixing metrics that measure what an executor actually has done in the past
window and metrics that extrapolate what an executor should have done in the
past window.
> I couldn't agree with you more than storm needs a History Server keep
historical information. Otherwise, executors are responsible for maintaining
their stats, which make them stateful. Is there any plan about the history
server?
There was some talk awhile ago around integrating Hadoop's Timeline Server
that is used with Tez, but looking in JIRA, I do not see an Issue created for
it. This is some thing that would definitely help users.
> By the way, adding throughput metric is my first step. And my ultimate
goal is to add normalized throughput, which leverages queueing theory to
provide a comparable performance metrics, similar but more accurate than
capacity that is currently available in Storm. With normalized throughput, one
can easily identify the performance bottleneck of a running topology by finding
the executor with minimal number in normalized throughput. With this
capability, we can develop a runtime scheduling algorithm to make better
resource allocation. So what do you think?
Feedback to the scheduler is a good idea, and it would open up interesting
possibilities. The capacity metric tries to show whether an executor is able
to keep up with the rate of input, and if that is what you mean by normalized
throughput, it could then replace capacity. We would just need to make sure to
update the tooltip explaining what the metric means.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---