[ https://issues.apache.org/jira/browse/FLINK-1502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15144128#comment-15144128 ]
Dongwon Kim edited comment on FLINK-1502 at 2/12/16 6:29 AM: ------------------------------------------------------------- Before deciding the design, we should take into consideration an environment in which a user can launch multiple TaskManager instances on a single machine (this is my local development environment) while Ganglia is usually setup to run a single monitoring daemon on each machine. This could be a common case sooner or later when Flink is capable of dynamic runtime scaling under YARN or MESOS (Spark already supports dynamic runtime scaling by executing multiple smaller executors per node and killing some of them when underloaded). What could be a problem in such an environment is that, if each of two TaskManagers running on a cluster node reports to Ganglia its metrics as if it is an only Flink daemon solely running on the node, Ganglia shows two different metrics in a single graph without aggregating them. The graph could be sawtooth shaped in my experience. A workaround could distinguish metrics from two TaskManagers by appending TaskManager IDs to the name of each metric when reporting to Ganglia. The workaround, however, will generate too many Ganglia metrics (also RRD files each corresponding to a Ganglia metric) because TaskManagers are given a randomly generated ID whenever newly launched. That being said, I design a initial plan as follows: - JobManager takes responsibility for reporting TaskManager's metrics to Ganglia/Graphite. Note that TaskManagers already send metrics through heartbeat messages to JobManager. - I want JobManager to aggregate metrics from TaskManagers running on the same node. I'm not sure whether this decision is good enough because different TaskManagers running on the same node could exhibit different runtime behaviors. - After aggregating values of a metric from different TaskManagers running on a cluster node, JobManager reports to Ganglia the aggregated value of the metric with the hostname. - By doing that, Ganglia will end up with having a single Ganglia metric. was (Author: eastcirclek): Before deciding the design, we should take into consideration an environment in which a user can launch multiple TaskManager instances on a single machine (this is my local development environment) while Ganglia is usually setup to run a single monitoring daemon on each machine. This could be a common case sooner or later when Flink is capable of dynamic runtime scaling under YARN or MESOS (Spark already supports dynamic runtime scaling by executing multiple smaller executors per node and killing some of them when underloaded). What could be a problem in such an environment is that, if each of two TaskManagers running on a cluster node reports to Ganglia its metrics as if it is an only Flink daemon solely running on the node, Ganglia shows two different metrics in a single graph without aggregating them. The graph could be sawtooth shaped in my experience. A workaround could distinguish metrics from two TaskManagers by appending TaskManager IDs to the name of each metric when reporting to Ganglia. The workaround, however, will generate too many Ganglia metrics (also RRD files each corresponding to a Ganglia metric) in the Ganglia master node because TaskManagers are given a randomly generated ID whenever newly launched. That being said, I design a initial plan as follows: - JobManager takes responsibility for reporting TaskManager's metrics to Ganglia/Graphite. Note that TaskManagers already send metrics through heartbeat messages to JobManager. - I want JobManager to aggregate metrics from TaskManagers running on the same node. I'm not sure whether this decision is good enough because different TaskManagers running on the same node could exhibit different runtime behaviors. - After aggregating values of a metric from different TaskManagers running on a cluster node, JobManager reports to Ganglia the aggregated value of the metric with the hostname. - By doing that, Ganglia will end up with having a single Ganglia metric. > Expose metrics to graphite, ganglia and JMX. > -------------------------------------------- > > Key: FLINK-1502 > URL: https://issues.apache.org/jira/browse/FLINK-1502 > Project: Flink > Issue Type: Sub-task > Components: JobManager, TaskManager > Affects Versions: 0.9 > Reporter: Robert Metzger > Assignee: Dongwon Kim > Priority: Minor > Fix For: pre-apache > > > The metrics library allows to expose collected metrics easily to other > systems such as graphite, ganglia or Java's JVM (VisualVM). -- This message was sent by Atlassian JIRA (v6.3.4#6332)