Rushabh S Shah created HDFS-9874:
------------------------------------

             Summary: Long living DataXceiver threads cause volume shutdown to 
block.
                 Key: HDFS-9874
                 URL: https://issues.apache.org/jira/browse/HDFS-9874
             Project: Hadoop HDFS
          Issue Type: Bug
          Components: datanode
    Affects Versions: 2.7.0
            Reporter: Rushabh S Shah
            Assignee: Rushabh S Shah
            Priority: Critical


One of the failed volume shutdown took 3 days to complete.
Below are the relevant datanode logs while shutting down a volume (due to disk 
failure)

{noformat}
2016-02-21 10:12:55,333 [Thread-49277] WARN impl.FsDatasetImpl: Removing failed 
volume volumeA/current: 
org.apache.hadoop.util.DiskChecker$DiskErrorException: Directory is not 
writable: volumeA/current/BP-1788428031-nnIp-1351700107344/current/finalized
        at 
org.apache.hadoop.util.DiskChecker.checkAccessByFileMethods(DiskChecker.java:194)
        at 
org.apache.hadoop.util.DiskChecker.checkDirAccess(DiskChecker.java:174)
        at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:108)
        at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.checkDirs(BlockPoolSlice.java:308)
        at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.checkDirs(FsVolumeImpl.java:786)
        at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeList.checkDirs(FsVolumeList.java:242)
        at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.checkDataDir(FsDatasetImpl.java:2011)
        at 
org.apache.hadoop.hdfs.server.datanode.DataNode.checkDiskError(DataNode.java:3145)
        at 
org.apache.hadoop.hdfs.server.datanode.DataNode.access$800(DataNode.java:243)
        at 
org.apache.hadoop.hdfs.server.datanode.DataNode$7.run(DataNode.java:3178)
        at java.lang.Thread.run(Thread.java:745)
2016-02-21 10:12:55,334 [Thread-49277] INFO datanode.BlockScanner: Removing 
scanner for volume volumeA (StorageID DS-cd2ea223-bab3-4361-a567-5f3f27a5dd23)
2016-02-21 10:12:55,334 [VolumeScannerThread(volumeA)] INFO 
datanode.VolumeScanner: VolumeScanner(volumeA, 
DS-cd2ea223-bab3-4361-a567-5f3f27a5dd23) exiting.
2016-02-21 10:12:55,335 [VolumeScannerThread(volumeA)] WARN 
datanode.VolumeScanner: VolumeScanner(volumeA, 
DS-cd2ea223-bab3-4361-a567-5f3f27a5dd23): error saving 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl$BlockIteratorImpl@4169ad8b.
java.io.FileNotFoundException: 
volumeA/current/BP-1788428031-nnIp-1351700107344/scanner.cursor.tmp (Read-only 
file system)
        at java.io.FileOutputStream.open(Native Method)
        at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
        at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl$BlockIteratorImpl.save(FsVolumeImpl.java:669)
        at 
org.apache.hadoop.hdfs.server.datanode.VolumeScanner.saveBlockIterator(VolumeScanner.java:314)
        at 
org.apache.hadoop.hdfs.server.datanode.VolumeScanner.run(VolumeScanner.java:633)

2016-02-24 16:05:53,285 [Thread-49277] WARN impl.FsDatasetImpl: Failed to 
delete old dfsUsed file in 
volumeA/current/BP-1788428031-nnIp-1351700107344/current
2016-02-24 16:05:53,286 [Thread-49277] WARN impl.FsDatasetImpl: Failed to write 
dfsUsed to volumeA/current/BP-1788428031-nnIp-1351700107344/current/dfsUsed
java.io.FileNotFoundException: 
volumeA/current/BP-1788428031-nnIp-1351700107344/current/dfsUsed (Read-only 
file system)
                at java.io.FileOutputStream.open(Native Method)
                at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
                at java.io.FileOutputStream.<init>(FileOutputStream.java:162)
                at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.saveDfsUsed(BlockPoolSlice.java:247)
                at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.shutdown(BlockPoolSlice.java:698)
                at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.shutdown(FsVolumeImpl.java:815)
                at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeList.removeVolume(FsVolumeList.java:328)
                at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeList.checkDirs(FsVolumeList.java:250)
                at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.checkDataDir(FsDatasetImpl.java:2011)
                at 
org.apache.hadoop.hdfs.server.datanode.DataNode.checkDiskError(DataNode.java:3145)
                at 
org.apache.hadoop.hdfs.server.datanode.DataNode.access$800(DataNode.java:243)
                at 
org.apache.hadoop.hdfs.server.datanode.DataNode$7.run(DataNode.java:3178)
                at java.lang.Thread.run(Thread.java:745)

2016-02-24 16:05:53,286 [Thread-49277] INFO impl.FsDatasetImpl: Removed volume: 
volumeA/current
2016-02-24 16:05:53,286 [Thread-49277] WARN impl.FsDatasetImpl: Completed 
checkDirs. Found 1 failure volumes.
2016-02-24 16:05:53,286 [Thread-49277] INFO datanode.DataNode: Deactivating 
volumes (clear failure=false): volumeA
2016-02-24 16:05:53,286 [Thread-49277] INFO impl.FsDatasetImpl: Removing 
volumeA from FsDataset.
2016-02-24 16:05:53,342 [Thread-49277] INFO common.Storage: Removing block 
level storage: volumeA/current/BP-1788428031-nnIp-1351700107344
2016-02-24 16:05:53,345 [Thread-49277] WARN datanode.DataNode: 
DataNode.handleDiskError: Keep Running: true
{noformat} 

The datanode waits for the reference count to become zero before shutting down 
the volume.
{code:title=FsVolumeImpl.java|borderStyle=solid}
while (this.reference.getReferenceCount() > 0) {
     if (FsDatasetImpl.LOG.isDebugEnabled()) {
       FsDatasetImpl.LOG.debug(String.format(
           "The reference count for %s is %d, wait to be 0.",
           this, reference.getReferenceCount()));
     }
     try {
       Thread.sleep(SLEEP_MILLIS);
     } catch (InterruptedException e) {
       throw new IOException(e);
     }
   }
{code}

Just before datanode logged the following line, 
{noformat} 
2016-02-24 16:05:53,285 [Thread-49277] WARN impl.FsDatasetImpl: Failed to 
delete old dfsUsed file in 
volumeA/current/BP-1788428031-nnIp-1351700107344/current
{noformat}
I saw the following stack trace
{noformat}
2016-02-24 16:05:53,017 [DataXceiver for client 
DFSClient_NONMAPREDUCE_1651663681_1 at /upStreamDNIp:55710 [Receiving block 
BP-1788428031-nnIp-1351700107344:blk_7059462432_1109832821906]] INFO 
datanode.DataNode: Exception for 
BP-1788428031-nnIp-1351700107344:blk_7059462432_1109832909736
java.io.IOException: Premature EOF from inputStream
        at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:201)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:501)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:895)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:801)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:253)
        at java.lang.Thread.run(Thread.java:745)
2016-02-24 16:05:53,018 [PacketResponder: 
BP-1788428031-nnIp-1351700107344:blk_7059462432_1109832909736, 
type=LAST_IN_PIPELINE, downstreams=0:[]] INFO datanode.DataNode: 
PacketResponder: BP-1788428031-nnIp-1351700107344:blk_7059462432_1109832909736, 
type=LAST_IN_PIPELINE, downstreams=0:[]: Thread is interrupted.
2016-02-24 16:05:53,018 [PacketResponder: 
BP-1788428031-nnIp-1351700107344:blk_7059462432_1109832909736, 
type=LAST_IN_PIPELINE, downstreams=0:[]] INFO datanode.DataNode: 
PacketResponder: BP-1788428031-nnIp-1351700107344:blk_7059462432_1109832909736, 
type=LAST_IN_PIPELINE, downstreams=0:[] terminating
2016-02-24 16:05:53,018 [DataXceiver for client 
DFSClient_NONMAPREDUCE_1651663681_1 at /upStreamDNIp:55710 [Receiving block 
BP-1788428031-nnIp-1351700107344:blk_7059462432_1109832821906]] INFO 
datanode.DataNode: opWriteBlock 
BP-1788428031-nnIp-1351700107344:blk_7059462432_1109832909736 received 
exception java.io.IOException: Prematur
e EOF from inputStream
2016-02-24 16:05:53,018 [DataXceiver for client 
DFSClient_NONMAPREDUCE_1651663681_1 at /upStreamDNIp:55710 [Receiving block 
BP-1788428031-nnIp-1351700107344:blk_7059462432_1109832821906]] ERROR 
datanode.DataNode: thisDNName:1004:DataXceiver error processing WRITE_BLOCK 
operation  src: /upStreamDNIp:55710 dst: /thisDNIp
:1004
java.io.IOException: Premature EOF from inputStream
        at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:201)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:501)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:895)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:801)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:253)
        at java.lang.Thread.run(Thread.java:745)

{noformat}


On tracking the block blk_7059462432_1109832821906, I see that the block was 
created on 2016-02-17 15:06:28,256
{noformat}
2016-02-17 15:06:28,928 [DataXceiver for client 
DFSClient_NONMAPREDUCE_1651663681_1 at /upStreamDNIp:55590 [Receiving block 
BP-1788428031-nnIp-1351700107344:blk_7059462432_1109832821906]] INFO 
datanode.DataNode: Receiving 
BP-1788428031-nnIp-1351700107344:blk_7059462432_1109832821906 src: 
/upStreamDNIp:55590 dest: /thisDNIp:1004
{noformat}


The job which created this file was running for more than 7 days and the client 
eventually failed to renew the delegation token so the lease manager failed to 
renew lease for this file.
Once commitBlockSynchronization kicked in, it tried to recover the block and 
eventually the DataXceiver thread died decrementing the reference count.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to