Github user rmetzger commented on the pull request:
https://github.com/apache/flink/pull/421#issuecomment-75545711
Thanks everybody for the positive feedback!
> What does the OS load mean? It would be really awesome to show the CPU
load, too. I think this is a helpful indicator.
On the OS load:
http://blog.scoutapp.com/articles/2009/07/31/understanding-load-averages
I totally agree that the OS load is not a very good metric for our
purposes.
The reason why I didn't try to get better metrics for this is that I didn't
want to play "ugly tricks" to get them.
My code is getting the metrics only via the management beans. The
`OperatingSystemMXBean` is only exposing the load and the number of processor
cores:
http://docs.oracle.com/javase/7/docs/api/java/lang/management/OperatingSystemMXBean.html#getSystemLoadAverage()
There is another implementation of the `OperatingSystemMXBean`
(https://docs.oracle.com/javase/7/docs/jre/api/management/extension/com/sun/management/OperatingSystemMXBean.html)
which is also exposing stuff like `getProcessCpuLoad()`.
But the availability of this management bean depends on the used JVM
version etc.
Another way to get the CPU load of the process would be parsing the output
of `ps` or `top`. But that also falls into the category of "ugly tricks".
I think we should aim for getting those metrics into the system as well.
Adding them is a matter of registering another Gauge in the TaskManager's
metrics registry and visualizing the JSON output.
I hope that these kinds of refinements are done by external contributors.
Once this PR has been merged, I'll file a JIRA to improve the CPU
monitoring.
>What are the current options for showing the detailed metrics? I see a
"show 3 TMs" and "show all TMs" button in the screenshot? Can you select which
three to show?
No, you cannot choose which three TMs.
I added these buttons because starting a large Flink cluster (50+ nodes)
will cause quite some load on the browser updating all the charts. Usually its
sufficient to see monitor the load of a few TMs only, because they are doing
mostly the same (ideally).
But I agree that there is room for improvement.
> How about we open a document and sketch the design of the monitoring and
create smaller PRs to get there step-by-step.
I totally agree that we should do small incremental improvements.
As I said in the PR description, the primary purpose of this PR is to get
the basic monitoring infrastructure in place, how we present the stuff in the
end is subject to further PRs.
I have started working on the "per-job" monitoring and found that I have to
change some details of this PR as well.
Depending on my progress on the "per-job" monitoring I might contribute the
changes here together with the "per-job" metrics. If I don't have enough time
this week to open a PR for the per job metrics this week, I'll merge this
change to master.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---