[ 
https://issues.apache.org/jira/browse/STORM-919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14651913#comment-14651913
 ] 

Shyam Rajendran commented on STORM-919:
---------------------------------------

- I have added few more metrics related to JVM such as GC time, GC count 
ThreadCPUTimes, ThreadAllocatedBytes stats in addition to node cpu stats.

Also it would be helpful to get opinion on the following approach to arrive at 
a relative rank of the topology components. This is to help implement feedback 
loop into the resource aware scheduler in future.

We took a specific metric as POC and have tried with some heuristic approach 
taking into account the Node and Worker slot information per component to rank 
the components. A REST endpoint is exposed that collates the details per worker 
slot ( Node + port ) for all the tasks executing in them like so, 

6897b645-667f-40ca-9cfc-5fa0d749e9cc 6704 1 0 1 1 0 0.14691151919866444 1.0
 6897b645-667f-40ca-9cfc-5fa0d749e9cc 6701 1 1 1 0 1 0.2525083612040134 1.0
 6897b645-667f-40ca-9cfc-5fa0d749e9cc 6703 1 1 0 1 0 0.16279069767441862 1.0
 08119658-a284-42fa-81a4-08358a08a91a 6703 1 1 0 1 0 0.1778523489932886 1.0
 08119658-a284-42fa-81a4-08358a08a91a 6702 1 1 1 0 0 0.18090452261306533 1.0
 08119658-a284-42fa-81a4-08358a08a91a 6704 1 0 1 1 0 0.13590604026845637 1.0
 6897b645-667f-40ca-9cfc-5fa0d749e9cc 6700 1 1 1 0 1 0.2441860465116279 1.0
 6897b645-667f-40ca-9cfc-5fa0d749e9cc 6702 1 1 1 0 0 0.19198664440734559 1.0
 08119658-a284-42fa-81a4-08358a08a91a 6700 1 1 1 0 1 0.25585284280936454 1.0
 08119658-a284-42fa-81a4-08358a08a91a 6701 1 1 1 0 1 0.24749163879598662 1.0

where each row is unique [ node + port ] followed by count of tasks per 
component executing on them. In this case we could see four tasks executing per 
slot. For example, the first row has one task of component 2 and component 3. 
It is then followed by JVM process Utilization and the node cpu utilization. 

The above information can be viewed as a system of M linear equations in N 
unknowns, unknowns here being the components. We used apache common math 
library to solve them using through a technique called Singular Value 
decomposition that helps in case of over determined or under determined system.

We tried to understand how the following scenarios impact the utilization
1 - Tasks that load the JVM process. 
2 - Tasks that have multilang support that potentially load the node
3 - Background threads spawned [ examples of database connections or such that 
could be shared among the executors . Help on this understanding would be 
appreciated] 

We tried with different topologies to simulate the above scenarios and solved 
equations using the below formula that tries to add a correction value to the 
JVM% stat read as a function of number of tasks and Node’s metrics.

 JVM% + (Sum of tasks in that JVM) (  (SuNode% - (sum of JVM% of workers in 
that SuNode) ) / (Task count on the SuNode) )

We have been seeing good results with this approach to ranking when compared to 
expected and observed utilization values. We would be running more tests in the 
coming days to tweak the formula and gather the accuracy. I haven’t got the 
pull request updated for this change yet but have been working on an internal 
fork of storm and would like to hear your opinion, suggestions and feedback to 
improve. Thanks!

> Gathering worker and supervisor process information (CPU/Memory)
> ----------------------------------------------------------------
>
>                 Key: STORM-919
>                 URL: https://issues.apache.org/jira/browse/STORM-919
>             Project: Apache Storm
>          Issue Type: New Feature
>            Reporter: Shyam Rajendran
>            Assignee: Shyam Rajendran
>            Priority: Minor
>
> It would be useful to have supervisor and worker process related information 
> such as %cpu utilization, JVM memory and network bandwidth available to 
> NIMBUS which would be useful for resource aware scheduler implementation 
> later on. As a beginning, the information can be piggybacked on the existing 
> heartbeats into the ZK or to the pacemaker as required. 
> Related JIRAs
> STORM-177
> STORM-891
> STORM-899



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to