[
https://issues.apache.org/jira/browse/HDFS-5951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13902048#comment-13902048
]
Todd Lipcon commented on HDFS-5951:
-----------------------------------
I agree with Aaron. I can think of several good reasons against self-monitoring
systems:
- It is impossible to check for things like external network connectivity. For
example, if a NN sees that it is getting 0 requests/sec, that may indicate that
the network is down, or it may just indicate that there are no clients. An
external system can provide much better data by actually checking that the NN
is accessible and correctly functioning (eg a canary)
- Similarly, if the RPC subsystem is dead, we can't tell that internally - we
need something like an external canary to tell us
- In my experience, a large majority of issues we see in HDFS are due to some
environmental issues -- for example frame errors on the NIC, machine swapping,
underprovisioned network resources, failing HDs, etc. These are obviously
out-of-scope for the NN to monitor, right? Given that any competent operator
needs to monitor all of the above, do they really gain a lot by also having a
web UI notice?
Additionally, a useful monitoring system has a lot more than a simple notice on
a web page. For example:
- SNMP traps to notify external systems of issues (bubble-up to corporate NOC
for example)
- Email or other alerts for issues.
- Configurable thresholds for metrics-based checks
- Historical information available to triggers (eg "metric X is above value Y
for at least Z minutes in a row")
I think we'll all agree that the above are out of scope for a system like HDFS.
Instead, HDFS should make sure that all interesting data is exposed as metrics,
and that the metrics are documented (perhaps with some advice on thresholds).
Additionally the community might make available a set of scripts to poll the
metrics which could be hooked into external systems like Nagios, etc.
> Provide diagnosis information in the Web UI
> -------------------------------------------
>
> Key: HDFS-5951
> URL: https://issues.apache.org/jira/browse/HDFS-5951
> Project: Hadoop HDFS
> Issue Type: Bug
> Reporter: Haohui Mai
> Assignee: Haohui Mai
> Attachments: HDFS-5951.000.patch, diagnosis-failure.png,
> diagnosis-succeed.png
>
>
> HDFS should provide operation statistics in its UI. it can go one step
> further by leveraging the information to diagnose common problems.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)