[ https://issues.apache.org/jira/browse/HDFS-9874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15178418#comment-15178418 ]
Hadoop QA commented on HDFS-9874: --------------------------------- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 11s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 1 new or modified test files. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 7m 35s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 44s {color} | {color:green} trunk passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 45s {color} | {color:green} trunk passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 24s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 55s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 14s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 11s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 13s {color} | {color:green} trunk passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 53s {color} | {color:green} trunk passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 50s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 42s {color} | {color:green} the patch passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 42s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 43s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 43s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 20s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 54s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 12s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 25s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 10s {color} | {color:green} the patch passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 56s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 65m 57s {color} | {color:red} hadoop-hdfs in the patch failed with JDK v1.8.0_72. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 57m 28s {color} | {color:red} hadoop-hdfs in the patch failed with JDK v1.7.0_95. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 22s {color} | {color:green} Patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 151m 20s {color} | {color:black} {color} | \\ \\ || Reason || Tests || | JDK v1.8.0_72 Failed junit tests | hadoop.hdfs.server.namenode.TestNamenodeCapacityReport | | | hadoop.hdfs.shortcircuit.TestShortCircuitCache | | JDK v1.7.0_95 Failed junit tests | hadoop.hdfs.server.namenode.TestEditLog | | | hadoop.hdfs.server.datanode.TestFsDatasetCache | \\ \\ || Subsystem || Report/Notes || | Docker | Image:yetus/hadoop:0ca8df7 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12791200/HDFS-9874-trunk.patch | | JIRA Issue | HDFS-9874 | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle | | uname | Linux bf00281a78f7 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 0a9f00a | | Default Java | 1.7.0_95 | | Multi-JDK versions | /usr/lib/jvm/java-8-oracle:1.8.0_72 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_95 | | findbugs | v3.0.0 | | unit | https://builds.apache.org/job/PreCommit-HDFS-Build/14705/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.8.0_72.txt | | unit | https://builds.apache.org/job/PreCommit-HDFS-Build/14705/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.7.0_95.txt | | unit test logs | https://builds.apache.org/job/PreCommit-HDFS-Build/14705/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.8.0_72.txt https://builds.apache.org/job/PreCommit-HDFS-Build/14705/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.7.0_95.txt | | JDK v1.7.0_95 Test Results | https://builds.apache.org/job/PreCommit-HDFS-Build/14705/testReport/ | | modules | C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs | | Console output | https://builds.apache.org/job/PreCommit-HDFS-Build/14705/console | | Powered by | Apache Yetus 0.3.0-SNAPSHOT http://yetus.apache.org | This message was automatically generated. > Long living DataXceiver threads cause volume shutdown to block. > --------------------------------------------------------------- > > Key: HDFS-9874 > URL: https://issues.apache.org/jira/browse/HDFS-9874 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode > Affects Versions: 2.7.0 > Reporter: Rushabh S Shah > Assignee: Rushabh S Shah > Priority: Critical > Attachments: HDFS-9874-trunk.patch > > > One of the failed volume shutdown took 3 days to complete. > Below are the relevant datanode logs while shutting down a volume (due to > disk failure) > {noformat} > 2016-02-21 10:12:55,333 [Thread-49277] WARN impl.FsDatasetImpl: Removing > failed volume volumeA/current: > org.apache.hadoop.util.DiskChecker$DiskErrorException: Directory is not > writable: volumeA/current/BP-1788428031-nnIp-1351700107344/current/finalized > at > org.apache.hadoop.util.DiskChecker.checkAccessByFileMethods(DiskChecker.java:194) > at > org.apache.hadoop.util.DiskChecker.checkDirAccess(DiskChecker.java:174) > at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:108) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.checkDirs(BlockPoolSlice.java:308) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.checkDirs(FsVolumeImpl.java:786) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeList.checkDirs(FsVolumeList.java:242) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.checkDataDir(FsDatasetImpl.java:2011) > at > org.apache.hadoop.hdfs.server.datanode.DataNode.checkDiskError(DataNode.java:3145) > at > org.apache.hadoop.hdfs.server.datanode.DataNode.access$800(DataNode.java:243) > at > org.apache.hadoop.hdfs.server.datanode.DataNode$7.run(DataNode.java:3178) > at java.lang.Thread.run(Thread.java:745) > 2016-02-21 10:12:55,334 [Thread-49277] INFO datanode.BlockScanner: Removing > scanner for volume volumeA (StorageID DS-cd2ea223-bab3-4361-a567-5f3f27a5dd23) > 2016-02-21 10:12:55,334 [VolumeScannerThread(volumeA)] INFO > datanode.VolumeScanner: VolumeScanner(volumeA, > DS-cd2ea223-bab3-4361-a567-5f3f27a5dd23) exiting. > 2016-02-21 10:12:55,335 [VolumeScannerThread(volumeA)] WARN > datanode.VolumeScanner: VolumeScanner(volumeA, > DS-cd2ea223-bab3-4361-a567-5f3f27a5dd23): error saving > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl$BlockIteratorImpl@4169ad8b. > java.io.FileNotFoundException: > volumeA/current/BP-1788428031-nnIp-1351700107344/scanner.cursor.tmp > (Read-only file system) > at java.io.FileOutputStream.open(Native Method) > at java.io.FileOutputStream.<init>(FileOutputStream.java:213) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl$BlockIteratorImpl.save(FsVolumeImpl.java:669) > at > org.apache.hadoop.hdfs.server.datanode.VolumeScanner.saveBlockIterator(VolumeScanner.java:314) > at > org.apache.hadoop.hdfs.server.datanode.VolumeScanner.run(VolumeScanner.java:633) > 2016-02-24 16:05:53,285 [Thread-49277] WARN impl.FsDatasetImpl: Failed to > delete old dfsUsed file in > volumeA/current/BP-1788428031-nnIp-1351700107344/current > 2016-02-24 16:05:53,286 [Thread-49277] WARN impl.FsDatasetImpl: Failed to > write dfsUsed to > volumeA/current/BP-1788428031-nnIp-1351700107344/current/dfsUsed > java.io.FileNotFoundException: > volumeA/current/BP-1788428031-nnIp-1351700107344/current/dfsUsed (Read-only > file system) > at java.io.FileOutputStream.open(Native Method) > at java.io.FileOutputStream.<init>(FileOutputStream.java:213) > at java.io.FileOutputStream.<init>(FileOutputStream.java:162) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.saveDfsUsed(BlockPoolSlice.java:247) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.shutdown(BlockPoolSlice.java:698) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.shutdown(FsVolumeImpl.java:815) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeList.removeVolume(FsVolumeList.java:328) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeList.checkDirs(FsVolumeList.java:250) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.checkDataDir(FsDatasetImpl.java:2011) > at > org.apache.hadoop.hdfs.server.datanode.DataNode.checkDiskError(DataNode.java:3145) > at > org.apache.hadoop.hdfs.server.datanode.DataNode.access$800(DataNode.java:243) > at > org.apache.hadoop.hdfs.server.datanode.DataNode$7.run(DataNode.java:3178) > at java.lang.Thread.run(Thread.java:745) > 2016-02-24 16:05:53,286 [Thread-49277] INFO impl.FsDatasetImpl: Removed > volume: volumeA/current > 2016-02-24 16:05:53,286 [Thread-49277] WARN impl.FsDatasetImpl: Completed > checkDirs. Found 1 failure volumes. > 2016-02-24 16:05:53,286 [Thread-49277] INFO datanode.DataNode: Deactivating > volumes (clear failure=false): volumeA > 2016-02-24 16:05:53,286 [Thread-49277] INFO impl.FsDatasetImpl: Removing > volumeA from FsDataset. > 2016-02-24 16:05:53,342 [Thread-49277] INFO common.Storage: Removing block > level storage: volumeA/current/BP-1788428031-nnIp-1351700107344 > 2016-02-24 16:05:53,345 [Thread-49277] WARN datanode.DataNode: > DataNode.handleDiskError: Keep Running: true > {noformat} > The datanode waits for the reference count to become zero before shutting > down the volume. > {code:title=FsVolumeImpl.java|borderStyle=solid} > while (this.reference.getReferenceCount() > 0) { > if (FsDatasetImpl.LOG.isDebugEnabled()) { > FsDatasetImpl.LOG.debug(String.format( > "The reference count for %s is %d, wait to be 0.", > this, reference.getReferenceCount())); > } > try { > Thread.sleep(SLEEP_MILLIS); > } catch (InterruptedException e) { > throw new IOException(e); > } > } > {code} > Just before datanode logged the following line, > {noformat} > 2016-02-24 16:05:53,285 [Thread-49277] WARN impl.FsDatasetImpl: Failed to > delete old dfsUsed file in > volumeA/current/BP-1788428031-nnIp-1351700107344/current > {noformat} > I saw the following stack trace > {noformat} > 2016-02-24 16:05:53,017 [DataXceiver for client > DFSClient_NONMAPREDUCE_1651663681_1 at /upStreamDNIp:55710 [Receiving block > BP-1788428031-nnIp-1351700107344:blk_7059462432_1109832821906]] INFO > datanode.DataNode: Exception for > BP-1788428031-nnIp-1351700107344:blk_7059462432_1109832909736 > java.io.IOException: Premature EOF from inputStream > at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:201) > at > org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213) > at > org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134) > at > org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:501) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:895) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:801) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:253) > at java.lang.Thread.run(Thread.java:745) > 2016-02-24 16:05:53,018 [PacketResponder: > BP-1788428031-nnIp-1351700107344:blk_7059462432_1109832909736, > type=LAST_IN_PIPELINE, downstreams=0:[]] INFO datanode.DataNode: > PacketResponder: > BP-1788428031-nnIp-1351700107344:blk_7059462432_1109832909736, > type=LAST_IN_PIPELINE, downstreams=0:[]: Thread is interrupted. > 2016-02-24 16:05:53,018 [PacketResponder: > BP-1788428031-nnIp-1351700107344:blk_7059462432_1109832909736, > type=LAST_IN_PIPELINE, downstreams=0:[]] INFO datanode.DataNode: > PacketResponder: > BP-1788428031-nnIp-1351700107344:blk_7059462432_1109832909736, > type=LAST_IN_PIPELINE, downstreams=0:[] terminating > 2016-02-24 16:05:53,018 [DataXceiver for client > DFSClient_NONMAPREDUCE_1651663681_1 at /upStreamDNIp:55710 [Receiving block > BP-1788428031-nnIp-1351700107344:blk_7059462432_1109832821906]] INFO > datanode.DataNode: opWriteBlock > BP-1788428031-nnIp-1351700107344:blk_7059462432_1109832909736 received > exception java.io.IOException: Prematur > e EOF from inputStream > 2016-02-24 16:05:53,018 [DataXceiver for client > DFSClient_NONMAPREDUCE_1651663681_1 at /upStreamDNIp:55710 [Receiving block > BP-1788428031-nnIp-1351700107344:blk_7059462432_1109832821906]] ERROR > datanode.DataNode: thisDNName:1004:DataXceiver error processing WRITE_BLOCK > operation src: /upStreamDNIp:55710 dst: /thisDNIp > :1004 > java.io.IOException: Premature EOF from inputStream > at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:201) > at > org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213) > at > org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134) > at > org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:501) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:895) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:801) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:253) > at java.lang.Thread.run(Thread.java:745) > {noformat} > On tracking the block blk_7059462432_1109832821906, I see that the block was > created on 2016-02-17 15:06:28,256 > {noformat} > 2016-02-17 15:06:28,928 [DataXceiver for client > DFSClient_NONMAPREDUCE_1651663681_1 at /upStreamDNIp:55590 [Receiving block > BP-1788428031-nnIp-1351700107344:blk_7059462432_1109832821906]] INFO > datanode.DataNode: Receiving > BP-1788428031-nnIp-1351700107344:blk_7059462432_1109832821906 src: > /upStreamDNIp:55590 dest: /thisDNIp:1004 > {noformat} > The job which created this file was running for more than 7 days and the > client eventually failed to renew the delegation token so the lease manager > failed to renew lease for this file. > Once commitBlockSynchronization kicked in, it tried to recover the block and > eventually the DataXceiver thread died decrementing the reference count. -- This message was sent by Atlassian JIRA (v6.3.4#6332)