Harsh J created HDFS-3590:
-----------------------------
Summary: Print a WARN if the edit log sync period takes more than
X time units
Key: HDFS-3590
URL: https://issues.apache.org/jira/browse/HDFS-3590
Project: Hadoop HDFS
Issue Type: Improvement
Components: name-node
Reporter: Harsh J
Priority: Minor
If an logSync operation, which happens for calls such as FS#create() after the
edit has been made at the NN metadata, takes longer than X seconds (I'd say if
it took more than a minute, there's something really wrong with the volume it
probably got stuck on), we should log a WARN with the volume that may have
particularly caused it. This helps track down, if an NN runs with multiple NFS
volumes, which particular volume may have caused it, as there's no per-NN-dir
metrics of any kind.
I ran into a situation today where a hard-mounted NFS point hung for over X
minutes but there was no indication in NN's logs after it recovered (recovering
so late caused its own slew of issues for which I'll file other improvement
JIRAs) that such an event happened, aside of the Sync (Journal Sync) metric
spiking with the elapsed sync time value rising up. A log would have helped
save time investigating this, and possibly would have also pin-pointed the bad
location more accurately.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira