[
https://issues.apache.org/jira/browse/HDFS-5931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Aaron T. Myers updated HDFS-5931:
---------------------------------
Description:
This is to report some improvements and potential bug fixes to some error
handling code. Also attaching a patch for review.
Details in the first comment.
was:
Hi,
This is to report some improvements and potential bug fixes to some error
handling code. Also attaching a patch for review.
==========================
Case 1:
org/apache/hadoop/hdfs/server/namenode/FSEditLog.java
In recoverUnclosedStreams, purgeLogsOlderThan, and endCurrentLogSegment,
when not enough journals are found, mapJournalsAndReportErrors throws
IOException.
While in some other cases (such as in logEdit()), this case would be handled by
a
later call to logSync(), which will terminate, but in these three methods these
exceptions
might be swallowed since logSync() won't be called in anytime soon.
Propose to immediately handle such cases as in logSync().
{noformat}
synchronized void recoverUnclosedStreams() {
Preconditions.checkState(
state == State.BETWEEN_LOG_SEGMENTS,
"May not recover segments - wrong state: %s", state);
try {
journalSet.recoverUnfinalizedSegments();
} catch (IOException ex) {
- // All journals have failed, it is handled in logSync.
- // TODO: are we sure this is OK?
+ //All journals have failed, handle it here.
+ final String msg =
+ "Could not find enough journals. "
+ LOG.fatal(msg, new Exception());
+ IOUtils.cleanup(LOG, journalSet);
+ terminate(1, msg);
}
}
{noformat}
==========================================
==========================
Case 2:
File: "org/apache/hadoop/hdfs/server/datanode/BlockPoolSliceStorage.java"
{noformat}
@@ -446,6 +446,8 @@
deleteDir(tmpDir);
} catch (IOException ex) {
LOG.error("Finalize upgrade for " + dataDirPath + " failed.", ex);
+ return; // return so users will not see confusing log messages where
+ // "failed" is immediately followed by "complete"
}
LOG.info("Finalize upgrade for " + dataDirPath + " is complete.");
}
{noformat}
In this case, the log can be confusing if deleteDir failed:
"Finalize upgrade failed" message will be immediately followed by "Finalize
upgrade is complete".
Same pattern can be seen at:
Line: 600, File: "org/apache/hadoop/hdfs/server/datanode/DataStorage.java"
Line: 1250, File:
"org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsDatasetImpl.java"
==========================================
==========================
Case 3:
File: "org/apache/hadoop/hdfs/server/datanode/BPServiceActor.java"
The log message doesn't include the exception cause.
{noformat}
@@ -167,9 +167,9 @@
LOG.debug(this + " received versionRequest response: " + nsInfo);
break;
} catch(SocketTimeoutException e) { // namenode is busy
- LOG.warn("Problem connecting to server: " + nnAddr);
+ LOG.warn("Problem connecting to server: " + nnAddr, e);
} catch(IOException e ) { // namenode is not available
- LOG.warn("Problem connecting to server: " + nnAddr);
+ LOG.warn("Problem connecting to server: " + nnAddr, e);
}
{noformat}
==========================================
==========================
Case 4:
File: "org/apache/hadoop/hdfs/server/datanode/BlockSender.java"
{noformat}
@@ -371,6 +371,7 @@
IOException ioe = null;
if (blockInFd != null &&
((dropCacheBehindAllReads) ||
(dropCacheBehindLargeReads && isLongRead()))) {
try {
NativeIO.POSIX.getCacheManipulator().posixFadviseIfPossible(
block.getBlockName(), blockInFd, lastCacheDropOffset,
offset - lastCacheDropOffset,
NativeIO.POSIX.POSIX_FADV_DONTNEED);
} catch (IOException e) {
LOG.warn("Unable to drop cache on file close", e);
+ ioe = e;
}
}
if(checksumIn!=null) {
try {
checksumIn.close(); // close checksum file
} catch (IOException e) {
ioe = e;
}
checksumIn = null;
.. ..
if (ioe!=null) {
throw ioe;
}
{noformat}
It's unclear why exception from posixFadviceIfPossible is not returned to the
caller.
==========================================
==========================
Case 5:
File: org/apache/hadoop/hdfs/server/datanode/DataXceiver.java
{noformat}
} finally {
dataXceiverServer.balanceThrottler.release();
if (isOpSuccess) {
@@ -751,7 +751,8 @@
try {
// send one last byte to indicate that the resource is cleaned.
reply.writeChar('d');
- } catch (IOException ignored) {
+ } catch (IOException e) {
+ LOG.warn("Exception caught after copy block succeeded " + e);
}
}
IOUtils.closeStream(reply);
{noformat}
It is unclear why the IOException is not logged.
Similar case:
Line: 876, File: "org/apache/hadoop/hdfs/server/datanode/DataXceiver.java"
==========================================
Hi Ding, thanks a lot for the report. Usually we try to keep the description of
a JIRA short since it's reproduced in every email update about the JIRA. You
are encouraged to put detailed descriptions in the first comment of the JIRA.
I'm reproducing them here for you:
Description of issues from Ding Yuan:
==========================
Case 1:
org/apache/hadoop/hdfs/server/namenode/FSEditLog.java
In recoverUnclosedStreams, purgeLogsOlderThan, and endCurrentLogSegment,
when not enough journals are found, mapJournalsAndReportErrors throws
IOException.
While in some other cases (such as in logEdit()), this case would be handled by
a
later call to logSync(), which will terminate, but in these three methods these
exceptions
might be swallowed since logSync() won't be called in anytime soon.
Propose to immediately handle such cases as in logSync().
{noformat}
synchronized void recoverUnclosedStreams() {
Preconditions.checkState(
state == State.BETWEEN_LOG_SEGMENTS,
"May not recover segments - wrong state: %s", state);
try {
journalSet.recoverUnfinalizedSegments();
} catch (IOException ex) {
- // All journals have failed, it is handled in logSync.
- // TODO: are we sure this is OK?
+ //All journals have failed, handle it here.
+ final String msg =
+ "Could not find enough journals. "
+ LOG.fatal(msg, new Exception());
+ IOUtils.cleanup(LOG, journalSet);
+ terminate(1, msg);
}
}
{noformat}
==========================================
==========================
Case 2:
File: "org/apache/hadoop/hdfs/server/datanode/BlockPoolSliceStorage.java"
{noformat}
@@ -446,6 +446,8 @@
deleteDir(tmpDir);
} catch (IOException ex) {
LOG.error("Finalize upgrade for " + dataDirPath + " failed.", ex);
+ return; // return so users will not see confusing log messages where
+ // "failed" is immediately followed by "complete"
}
LOG.info("Finalize upgrade for " + dataDirPath + " is complete.");
}
{noformat}
In this case, the log can be confusing if deleteDir failed:
"Finalize upgrade failed" message will be immediately followed by "Finalize
upgrade is complete".
Same pattern can be seen at:
Line: 600, File: "org/apache/hadoop/hdfs/server/datanode/DataStorage.java"
Line: 1250, File:
"org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsDatasetImpl.java"
==========================================
==========================
Case 3:
File: "org/apache/hadoop/hdfs/server/datanode/BPServiceActor.java"
The log message doesn't include the exception cause.
{noformat}
@@ -167,9 +167,9 @@
LOG.debug(this + " received versionRequest response: " + nsInfo);
break;
} catch(SocketTimeoutException e) { // namenode is busy
- LOG.warn("Problem connecting to server: " + nnAddr);
+ LOG.warn("Problem connecting to server: " + nnAddr, e);
} catch(IOException e ) { // namenode is not available
- LOG.warn("Problem connecting to server: " + nnAddr);
+ LOG.warn("Problem connecting to server: " + nnAddr, e);
}
{noformat}
==========================================
==========================
Case 4:
File: "org/apache/hadoop/hdfs/server/datanode/BlockSender.java"
{noformat}
@@ -371,6 +371,7 @@
IOException ioe = null;
if (blockInFd != null &&
((dropCacheBehindAllReads) ||
(dropCacheBehindLargeReads && isLongRead()))) {
try {
NativeIO.POSIX.getCacheManipulator().posixFadviseIfPossible(
block.getBlockName(), blockInFd, lastCacheDropOffset,
offset - lastCacheDropOffset,
NativeIO.POSIX.POSIX_FADV_DONTNEED);
} catch (IOException e) {
LOG.warn("Unable to drop cache on file close", e);
+ ioe = e;
}
}
if(checksumIn!=null) {
try {
checksumIn.close(); // close checksum file
} catch (IOException e) {
ioe = e;
}
checksumIn = null;
.. ..
if (ioe!=null) {
throw ioe;
}
{noformat}
It's unclear why exception from posixFadviceIfPossible is not returned to the
caller.
==========================================
==========================
Case 5:
File: org/apache/hadoop/hdfs/server/datanode/DataXceiver.java
{noformat}
} finally {
dataXceiverServer.balanceThrottler.release();
if (isOpSuccess) {
@@ -751,7 +751,8 @@
try {
// send one last byte to indicate that the resource is cleaned.
reply.writeChar('d');
- } catch (IOException ignored) {
+ } catch (IOException e) {
+ LOG.warn("Exception caught after copy block succeeded " + e);
}
}
IOUtils.closeStream(reply);
{noformat}
It is unclear why the IOException is not logged.
Similar case:
Line: 876, File: "org/apache/hadoop/hdfs/server/datanode/DataXceiver.java"
==========================================
> Potential bugs and improvements for exception handlers
> ------------------------------------------------------
>
> Key: HDFS-5931
> URL: https://issues.apache.org/jira/browse/HDFS-5931
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: datanode, namenode
> Affects Versions: 2.2.0
> Reporter: Ding Yuan
> Attachments: hdfs-5931.patch
>
>
> This is to report some improvements and potential bug fixes to some error
> handling code. Also attaching a patch for review.
> Details in the first comment.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)