Ding Yuan created HDFS-6145:
-------------------------------
Summary: Stopping unexpected exception from propagating to avoid
serious consequences
Key: HDFS-6145
URL: https://issues.apache.org/jira/browse/HDFS-6145
Project: Hadoop HDFS
Issue Type: Improvement
Affects Versions: 2.2.0
Reporter: Ding Yuan
There are a few cases where an exception should never have occurred, but the
code simply logged it and let the execution continue. Since they shouldn't have
occurred, a safer way may be to simply terminate the execution and stop them
from propagating into some unexpected consequences.
==========================
Case 1:
Line: 336, File:
"org/apache/hadoop/hdfs/server/namenode/snapshot/INodeDirectorySnapshottable.java"
{noformat}
325: try {
326: Quota.Counts counts = cleanSubtree(snapshot, prior,
collectedBlocks,
327: removedINodes, true);
328: INodeDirectory parent = getParent();
.. ..
335: } catch(QuotaExceededException e) {
336: LOG.error("BUG: removeSnapshot increases namespace usage.", e);
337: }
{noformat}
Since this shouldn't have occurred unless some unexpected bugs occur,
should the NN simply stop the execution to prevent bad things from propagation?
Similar handling of QuotaExceededException can be found at:
Line: 544, File: "org/apache/hadoop/hdfs/server/namenode/INodeReference.java"
Line: 657, File: "org/apache/hadoop/hdfs/server/namenode/INodeReference.java"
Line: 669, File: "org/apache/hadoop/hdfs/server/namenode/INodeReference.java"
==========================================
==========================
Case 2:
Line: 601, File: "org/apache/hadoop/hdfs/server/namenode/JournalSet.java"
{noformat}
591: public synchronized RemoteEditLogManifest getEditLogManifest(long
fromTxId,
..
595: for (JournalAndStream j : journals) {
..
598: try {
599: allLogs.addAll(fjm.getRemoteEditLogs(fromTxId, forReading,
false));
600: } catch (Throwable t) {
601: LOG.warn("Cannot list edit logs in " + fjm, t);
602: }
{noformat}
An exception from addAll will result in some edit log files not considered, and
not included in the checkpoint, which may result in dataloss.
==========================================
==========================
Case 3:
Line: 4029, File: "org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java"
{noformat}
4010: try {
4011: while (fsRunning && shouldNNRmRun) {
4012: checkAvailableResources();
4013: if(!nameNodeHasResourcesAvailable()) {
4014: String lowResourcesMsg = "NameNode low on available disk
space. ";
4015: if (!isInSafeMode()) {
4016: FSNamesystem.LOG.warn(lowResourcesMsg + "Entering safe
mode.");
4017: } else {
4018: FSNamesystem.LOG.warn(lowResourcesMsg + "Already in safe
mode.");
4019: }
4020: enterSafeMode(true);
4021: }
.. ..
4027: }
4028: } catch (Exception e) {
4029: FSNamesystem.LOG.error("Exception in NameNodeResourceMonitor: ",
e);
4030: }
{noformat}
enterSafeMode might thrown exception. In the case of not being able to entering
safe mode, should the execution simply terminate?
==========================================
--
This message was sent by Atlassian JIRA
(v6.2#6252)