[
https://issues.apache.org/jira/browse/HDFS-5931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ding Yuan updated HDFS-5931:
----------------------------
Attachment: hdfs-5931.patch
> Potential bugs and improvements for exception handlers
> ------------------------------------------------------
>
> Key: HDFS-5931
> URL: https://issues.apache.org/jira/browse/HDFS-5931
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: datanode, namenode
> Affects Versions: 2.2.0
> Reporter: Ding Yuan
> Attachments: hdfs-5931.patch
>
>
> Hi,
> This is to report some improvements and potential bug fixes to some error
> handling code. Also attaching a patch for review.
> ==========================
> Case 1:
> org/apache/hadoop/hdfs/server/namenode/FSEditLog.java
> In recoverUnclosedStreams, purgeLogsOlderThan, and endCurrentLogSegment,
> when not enough journals are found, mapJournalsAndReportErrors throws
> IOException.
> While in some other cases (such as in logEdit()), this case would be handled
> by a
> later call to logSync(), which will terminate, but in these three methods
> these exceptions
> might be swallowed since logSync() won't be called in anytime soon.
> Propose to immediately handle such cases as in logSync().
> {noformat}
> synchronized void recoverUnclosedStreams() {
> Preconditions.checkState(
> state == State.BETWEEN_LOG_SEGMENTS,
> "May not recover segments - wrong state: %s", state);
> try {
> journalSet.recoverUnfinalizedSegments();
> } catch (IOException ex) {
> - // All journals have failed, it is handled in logSync.
> - // TODO: are we sure this is OK?
> + //All journals have failed, handle it here.
> + final String msg =
> + "Could not find enough journals. "
> + LOG.fatal(msg, new Exception());
> + IOUtils.cleanup(LOG, journalSet);
> + terminate(1, msg);
> }
> }
> {noformat}
> ==========================================
> ==========================
> Case 2:
> File: "org/apache/hadoop/hdfs/server/datanode/BlockPoolSliceStorage.java"
> {noformat}
> @@ -446,6 +446,8 @@
> deleteDir(tmpDir);
> } catch (IOException ex) {
> LOG.error("Finalize upgrade for " + dataDirPath + " failed.", ex);
> + return; // return so users will not see confusing log messages
> where
> + // "failed" is immediately followed by "complete"
> }
> LOG.info("Finalize upgrade for " + dataDirPath + " is complete.");
> }
> {noformat}
> In this case, the log can be confusing if deleteDir failed:
> "Finalize upgrade failed" message will be immediately followed by "Finalize
> upgrade is complete".
> Same pattern can be seen at:
> Line: 600, File: "org/apache/hadoop/hdfs/server/datanode/DataStorage.java"
> Line: 1250, File:
> "org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsDatasetImpl.java"
> ==========================================
> ==========================
> Case 3:
> File: "org/apache/hadoop/hdfs/server/datanode/BPServiceActor.java"
> The log message doesn't include the exception cause.
> {noformat}
> @@ -167,9 +167,9 @@
> LOG.debug(this + " received versionRequest response: " + nsInfo);
> break;
> } catch(SocketTimeoutException e) { // namenode is busy
> - LOG.warn("Problem connecting to server: " + nnAddr);
> + LOG.warn("Problem connecting to server: " + nnAddr, e);
> } catch(IOException e ) { // namenode is not available
> - LOG.warn("Problem connecting to server: " + nnAddr);
> + LOG.warn("Problem connecting to server: " + nnAddr, e);
> }
> {noformat}
> ==========================================
> ==========================
> Case 4:
> File: "org/apache/hadoop/hdfs/server/datanode/BlockSender.java"
> {noformat}
> @@ -371,6 +371,7 @@
> IOException ioe = null;
> if (blockInFd != null &&
> ((dropCacheBehindAllReads) ||
> (dropCacheBehindLargeReads && isLongRead()))) {
> try {
> NativeIO.POSIX.getCacheManipulator().posixFadviseIfPossible(
> block.getBlockName(), blockInFd, lastCacheDropOffset,
> offset - lastCacheDropOffset,
> NativeIO.POSIX.POSIX_FADV_DONTNEED);
> } catch (IOException e) {
> LOG.warn("Unable to drop cache on file close", e);
> + ioe = e;
> }
> }
> if(checksumIn!=null) {
> try {
> checksumIn.close(); // close checksum file
> } catch (IOException e) {
> ioe = e;
> }
> checksumIn = null;
> .. ..
> if (ioe!=null) {
> throw ioe;
> }
> {noformat}
> It's unclear why exception from posixFadviceIfPossible is not returned to the
> caller.
> ==========================================
> ==========================
> Case 5:
> File: org/apache/hadoop/hdfs/server/datanode/DataXceiver.java
> {noformat}
> } finally {
> dataXceiverServer.balanceThrottler.release();
> if (isOpSuccess) {
> @@ -751,7 +751,8 @@
> try {
> // send one last byte to indicate that the resource is cleaned.
> reply.writeChar('d');
> - } catch (IOException ignored) {
> + } catch (IOException e) {
> + LOG.warn("Exception caught after copy block succeeded " + e);
> }
> }
> IOUtils.closeStream(reply);
> {noformat}
> It is unclear why the IOException is not logged.
> Similar case:
> Line: 876, File: "org/apache/hadoop/hdfs/server/datanode/DataXceiver.java"
> ==========================================
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)