[
https://issues.apache.org/jira/browse/HDFS-6145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13965255#comment-13965255
]
Ding Yuan commented on HDFS-6145:
---------------------------------
Ping. Is there anything else I can help from my side?
> Stopping unexpected exception from propagating to avoid serious consequences
> ----------------------------------------------------------------------------
>
> Key: HDFS-6145
> URL: https://issues.apache.org/jira/browse/HDFS-6145
> Project: Hadoop HDFS
> Issue Type: Improvement
> Affects Versions: 2.2.0
> Reporter: Ding Yuan
>
> There are a few cases where an exception should never have occurred, but the
> code simply logged it and let the execution continue. Since they shouldn't
> have occurred, a safer way may be to simply terminate the execution and stop
> them from propagating into some unexpected consequences.
> ==========================
> Case 1:
> Line: 336, File:
> "org/apache/hadoop/hdfs/server/namenode/snapshot/INodeDirectorySnapshottable.java"
> {noformat}
> 325: try {
> 326: Quota.Counts counts = cleanSubtree(snapshot, prior,
> collectedBlocks,
> 327: removedINodes, true);
> 328: INodeDirectory parent = getParent();
> .. ..
> 335: } catch(QuotaExceededException e) {
> 336: LOG.error("BUG: removeSnapshot increases namespace usage.", e);
> 337: }
> {noformat}
> Since this shouldn't have occurred unless some unexpected bugs occur,
> should the NN simply stop the execution to prevent bad things from
> propagation?
> Similar handling of QuotaExceededException can be found at:
> Line: 544, File:
> "org/apache/hadoop/hdfs/server/namenode/INodeReference.java"
> Line: 657, File:
> "org/apache/hadoop/hdfs/server/namenode/INodeReference.java"
> Line: 669, File:
> "org/apache/hadoop/hdfs/server/namenode/INodeReference.java"
> ==========================================
> ==========================
> Case 2:
> Line: 601, File: "org/apache/hadoop/hdfs/server/namenode/JournalSet.java"
> {noformat}
> 591: public synchronized RemoteEditLogManifest getEditLogManifest(long
> fromTxId,
> ..
> 595: for (JournalAndStream j : journals) {
> ..
> 598: try {
> 599: allLogs.addAll(fjm.getRemoteEditLogs(fromTxId, forReading,
> false));
> 600: } catch (Throwable t) {
> 601: LOG.warn("Cannot list edit logs in " + fjm, t);
> 602: }
> {noformat}
> An exception from addAll will result in some edit log files not considered,
> and not included in the checkpoint, which may result in dataloss.
> ==========================================
> ==========================
> Case 3:
> Line: 4029, File: "org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java"
> {noformat}
> 4010: try {
> 4011: while (fsRunning && shouldNNRmRun) {
> 4012: checkAvailableResources();
> 4013: if(!nameNodeHasResourcesAvailable()) {
> 4014: String lowResourcesMsg = "NameNode low on available disk
> space. ";
> 4015: if (!isInSafeMode()) {
> 4016: FSNamesystem.LOG.warn(lowResourcesMsg + "Entering safe
> mode.");
> 4017: } else {
> 4018: FSNamesystem.LOG.warn(lowResourcesMsg + "Already in safe
> mode.");
> 4019: }
> 4020: enterSafeMode(true);
> 4021: }
> .. ..
> 4027: }
> 4028: } catch (Exception e) {
> 4029: FSNamesystem.LOG.error("Exception in NameNodeResourceMonitor:
> ", e);
> 4030: }
> {noformat}
> enterSafeMode might thrown exception. In the case of not being able to
> entering safe mode, should the execution simply terminate?
> ==========================================
--
This message was sent by Atlassian JIRA
(v6.2#6252)