[ https://issues.apache.org/jira/browse/FLINK-1502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15158090#comment-15158090 ]
Jamie Grier commented on FLINK-1502: ------------------------------------ [~eastcirclek] Let's define our terms to make sure we're talking about the same thing. *Session*: A single instance of a Job Manager and some # of TaskManagers working together. A session can be created "on-the-fly" for a single job or it can be a long-running thing. Multiple jobs can start, run, and finish in the same session. Think of the "yarn-session.sh" command. This creates a session outside of any particular job. This is also what I've meant when I've said "cluster". A Yarn session is a "cluster" that we've spun up for some length of time on Yarn. Another example of a cluster would be a standalone install of Flink on some # of machines. *Job*: A single batch or streaming job that runs on a Flink cluster. In the above scenario, and if your definition of sessions is in agreement with mine. You would instead have the following. Note that I've named the cluster according to the "session" name you've given, because in this case each session is really a different (ad-hoc) cluster. When you run a job directly using just "flink run -ytm ..." on YARN you are spinning up an ad-hoc cluster for your job. After Session 1 is finished, Node 1 would have the following metrics: - cluster.session1.taskmanager.1.gc_time After session 2 is finshed, Node 1 would have the following metrics: - cluster.session1.taskmanager.1.gc_time - cluster.session2.taskmanager.2.gc_time - cluster.session3.taskmanager.3.gc_time There are many metrics in this case because that's exactly what you want. These are JVM scope metrics we are talking about and those are 3 different JVMS, not the same one so it makes total sense for them to have these different names/scopes. These metrics have nothing to do with each other and it doesn't matter which host they are from. They are scoped to the cluster (or session) and logical TaskManager index, not the host. The above should not be confused with any host level metrics we want to report. Host level metrics would be scoped simply by the hostname so they wouldn't grow either. One more example, hopefully to clarify. Let's say I spun up a long-running cluster (or session) using yarn-session.sh -tm 3. Now we have a Flink cluster running on YARN with no jobs running and three TaskManagers. We then run three different jobs one after another on this cluster. The metrics would still simply be: - cluster.yarn-session.taskmanager.1.gc_time - cluster.yarn-session.taskmanager.2.gc_time - cluster.yarn-session.taskmanager.3.gc_time No matter how many jobs you ran this list would not grow, which is natural because there have only been 3 TaskManagers. Now if one of these TaskManagers were to fail and be restarted it would assume the same name -- that's the point of using "logical" indexes so the set of metrics name in that case still would not be larger than the above. In the initial case you describe above if you didn't want lot's of different metrics over time you could also just give all of your sessions the same name. You're metrics are growing because you're spinning up many different clusters (sessions) over time with different names each time. If you used the same name for the cluster (session) every time this metrics namespace growth would not occur. I hope any of that made sense ;) This is getting a bit hard to describe this way. We could also sync via Hangouts or something if that is easier. > Expose metrics to graphite, ganglia and JMX. > -------------------------------------------- > > Key: FLINK-1502 > URL: https://issues.apache.org/jira/browse/FLINK-1502 > Project: Flink > Issue Type: Sub-task > Components: JobManager, TaskManager > Affects Versions: 0.9 > Reporter: Robert Metzger > Assignee: Dongwon Kim > Priority: Minor > Fix For: pre-apache > > > The metrics library allows to expose collected metrics easily to other > systems such as graphite, ganglia or Java's JVM (VisualVM). -- This message was sent by Atlassian JIRA (v6.3.4#6332)