Github user d2r commented on the pull request:

    https://github.com/apache/storm/pull/753#issuecomment-155489896
  
    > @d2r ,
    > Thank you very much for your prompt response. However, I cannot quite 
understand your meaning by
    > 
    >     If the previous worker's throughput stats had declined sharply before 
the worker had died, then weighting the current worker's throughput stats still 
would be inaccurate, but in a different way.
    > 
    > I will appreciate it a lot if you could provide a concrete example.
    
    Suppose an executors throughput behaves such that it starts at a normal 
level, but declines for a while until the worker dies.  (This could happen if 
there is a leak leading to JVM heap exhaustion, leading to frequent major 
garbage collections and then to the worker being killed.)  When the worker 
restarts, it will restart with an empty metrics state and will begin again at 
normal throughput.  What we would see is a weighted metric that seems to say 
the executor has had a normal throughput for the entire duration of the window, 
when in reality the executor's throughput was lower.
    
    The other metrics we use so far reset the count when the new worker is 
started, and they do not extrapolate for the full time window.  So adding a 
throughput metric that does not behave this way would be inconsistent.  The 
throughput numbers would not correlate with executed and transferred numbers, 
for example.
    
    This is the inconsistency I was referring to.  Both methods can be 
inaccurate, but they are inconsistent when both are used together.  I would 
rather the UI be self-consistent, for our users' sakes.  I would like to avoid 
mixing metrics that measure what an executor actually has done in the past 
window and metrics that extrapolate what an executor should have done in the 
past window.
    
    
    
    > I couldn't agree with you more than storm needs a History Server keep 
historical information. Otherwise, executors are responsible for maintaining 
their stats, which make them stateful. Is there any plan about the history 
server?
    
    There was some talk awhile ago around integrating Hadoop's Timeline Server 
that is used with Tez, but looking in JIRA, I do not see an Issue created for 
it.  This is some thing that would definitely help users.
    
    
    > By the way, adding throughput metric is my first step. And my ultimate 
goal is to add normalized throughput, which leverages queueing theory to 
provide a comparable performance metrics, similar but more accurate than 
capacity that is currently available in Storm. With normalized throughput, one 
can easily identify the performance bottleneck of a running topology by finding 
the executor with minimal number in normalized throughput. With this 
capability, we can develop a runtime scheduling algorithm to make better 
resource allocation. So what do you think?
    
    Feedback to the scheduler is a good idea, and it would open up interesting 
possibilities.  The capacity metric tries to show whether an executor is able 
to keep up with the rate of input, and if that is what you mean by normalized 
throughput, it could then replace capacity.  We would just need to make sure to 
update the tooltip explaining what the metric means.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

Reply via email to