[jira] [Comment Edited] (FLINK-1502) Expose metrics to graphite, ganglia and JMX.

Dongwon Kim (JIRA) Thu, 11 Feb 2016 22:30:29 -0800

    [ 
https://issues.apache.org/jira/browse/FLINK-1502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15144128#comment-15144128
 ]


Dongwon Kim edited comment on FLINK-1502 at 2/12/16 6:29 AM:
-------------------------------------------------------------

Before deciding the design, we should take into consideration an environment in 
which a user can launch multiple TaskManager instances on a single machine 
(this is my local development environment) while Ganglia is usually setup to 
run a single monitoring daemon on each machine. This could be a common case 
sooner or later when Flink is capable of dynamic runtime scaling under YARN or 
MESOS (Spark already supports dynamic runtime scaling by executing multiple 
smaller executors per node and killing some of them when underloaded). 

What could be a problem in such an environment is that, if each of two 
TaskManagers running on a cluster node reports to Ganglia its metrics as if it 
is an only Flink daemon solely running on the node, Ganglia shows two different 
metrics in a single graph without aggregating them. The graph could be sawtooth 
shaped in my experience. A workaround could distinguish metrics from two 
TaskManagers by appending TaskManager IDs to the name of each metric when 
reporting to Ganglia. The workaround, however, will generate too many Ganglia 
metrics (also RRD files each corresponding to a Ganglia metric) because 
TaskManagers are given a randomly generated ID whenever newly launched.

That being said, I design a initial plan as follows:
- JobManager takes responsibility for reporting TaskManager's metrics to 
Ganglia/Graphite. Note that TaskManagers already send metrics through heartbeat 
messages to JobManager. 
- I want JobManager to aggregate metrics from TaskManagers running on the same 
node. I'm not sure whether this decision is good enough because different 
TaskManagers running on the same node could exhibit different runtime behaviors.
- After aggregating values of a metric from different TaskManagers running on a 
cluster node, JobManager reports to Ganglia the aggregated value of the metric 
with the hostname. 
- By doing that, Ganglia will end up with having a single Ganglia metric.


was (Author: eastcirclek):
Before deciding the design, we should take into consideration an environment in 
which a user can launch multiple TaskManager instances on a single machine 
(this is my local development environment) while Ganglia is usually setup to 
run a single monitoring daemon on each machine. This could be a common case 
sooner or later when Flink is capable of dynamic runtime scaling under YARN or 
MESOS (Spark already supports dynamic runtime scaling by executing multiple 
smaller executors per node and killing some of them when underloaded). 

What could be a problem in such an environment is that, if each of two 
TaskManagers running on a cluster node reports to Ganglia its metrics as if it 
is an only Flink daemon solely running on the node, Ganglia shows two different 
metrics in a single graph without aggregating them. The graph could be sawtooth 
shaped in my experience. A workaround could distinguish metrics from two 
TaskManagers by appending TaskManager IDs to the name of each metric when 
reporting to Ganglia. The workaround, however, will generate too many Ganglia 
metrics (also RRD files each corresponding to a Ganglia metric) in the Ganglia 
master node because TaskManagers are given a randomly generated ID whenever 
newly launched.

That being said, I design a initial plan as follows:
- JobManager takes responsibility for reporting TaskManager's metrics to 
Ganglia/Graphite. Note that TaskManagers already send metrics through heartbeat 
messages to JobManager. 
- I want JobManager to aggregate metrics from TaskManagers running on the same 
node. I'm not sure whether this decision is good enough because different 
TaskManagers running on the same node could exhibit different runtime behaviors.
- After aggregating values of a metric from different TaskManagers running on a 
cluster node, JobManager reports to Ganglia the aggregated value of the metric 
with the hostname. 
- By doing that, Ganglia will end up with having a single Ganglia metric.

> Expose metrics to graphite, ganglia and JMX.
> --------------------------------------------
>
>                 Key: FLINK-1502
>                 URL: https://issues.apache.org/jira/browse/FLINK-1502
>             Project: Flink
>          Issue Type: Sub-task
>          Components: JobManager, TaskManager
>    Affects Versions: 0.9
>            Reporter: Robert Metzger
>            Assignee: Dongwon Kim
>            Priority: Minor
>             Fix For: pre-apache
>
>
> The metrics library allows to expose collected metrics easily to other 
> systems such as graphite, ganglia or Java's JVM (VisualVM).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (FLINK-1502) Expose metrics to graphite, ganglia and JMX.

Reply via email to