Rushabh S Shah created HDFS-9874:
------------------------------------
Summary: Long living DataXceiver threads cause volume shutdown to
block.
Key: HDFS-9874
URL: https://issues.apache.org/jira/browse/HDFS-9874
Project: Hadoop HDFS
Issue Type: Bug
Components: datanode
Affects Versions: 2.7.0
Reporter: Rushabh S Shah
Assignee: Rushabh S Shah
Priority: Critical
One of the failed volume shutdown took 3 days to complete.
Below are the relevant datanode logs while shutting down a volume (due to disk
failure)
{noformat}
2016-02-21 10:12:55,333 [Thread-49277] WARN impl.FsDatasetImpl: Removing failed
volume volumeA/current:
org.apache.hadoop.util.DiskChecker$DiskErrorException: Directory is not
writable: volumeA/current/BP-1788428031-nnIp-1351700107344/current/finalized
at
org.apache.hadoop.util.DiskChecker.checkAccessByFileMethods(DiskChecker.java:194)
at
org.apache.hadoop.util.DiskChecker.checkDirAccess(DiskChecker.java:174)
at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:108)
at
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.checkDirs(BlockPoolSlice.java:308)
at
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.checkDirs(FsVolumeImpl.java:786)
at
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeList.checkDirs(FsVolumeList.java:242)
at
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.checkDataDir(FsDatasetImpl.java:2011)
at
org.apache.hadoop.hdfs.server.datanode.DataNode.checkDiskError(DataNode.java:3145)
at
org.apache.hadoop.hdfs.server.datanode.DataNode.access$800(DataNode.java:243)
at
org.apache.hadoop.hdfs.server.datanode.DataNode$7.run(DataNode.java:3178)
at java.lang.Thread.run(Thread.java:745)
2016-02-21 10:12:55,334 [Thread-49277] INFO datanode.BlockScanner: Removing
scanner for volume volumeA (StorageID DS-cd2ea223-bab3-4361-a567-5f3f27a5dd23)
2016-02-21 10:12:55,334 [VolumeScannerThread(volumeA)] INFO
datanode.VolumeScanner: VolumeScanner(volumeA,
DS-cd2ea223-bab3-4361-a567-5f3f27a5dd23) exiting.
2016-02-21 10:12:55,335 [VolumeScannerThread(volumeA)] WARN
datanode.VolumeScanner: VolumeScanner(volumeA,
DS-cd2ea223-bab3-4361-a567-5f3f27a5dd23): error saving
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl$BlockIteratorImpl@4169ad8b.
java.io.FileNotFoundException:
volumeA/current/BP-1788428031-nnIp-1351700107344/scanner.cursor.tmp (Read-only
file system)
at java.io.FileOutputStream.open(Native Method)
at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
at
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl$BlockIteratorImpl.save(FsVolumeImpl.java:669)
at
org.apache.hadoop.hdfs.server.datanode.VolumeScanner.saveBlockIterator(VolumeScanner.java:314)
at
org.apache.hadoop.hdfs.server.datanode.VolumeScanner.run(VolumeScanner.java:633)
2016-02-24 16:05:53,285 [Thread-49277] WARN impl.FsDatasetImpl: Failed to
delete old dfsUsed file in
volumeA/current/BP-1788428031-nnIp-1351700107344/current
2016-02-24 16:05:53,286 [Thread-49277] WARN impl.FsDatasetImpl: Failed to write
dfsUsed to volumeA/current/BP-1788428031-nnIp-1351700107344/current/dfsUsed
java.io.FileNotFoundException:
volumeA/current/BP-1788428031-nnIp-1351700107344/current/dfsUsed (Read-only
file system)
at java.io.FileOutputStream.open(Native Method)
at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
at java.io.FileOutputStream.<init>(FileOutputStream.java:162)
at
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.saveDfsUsed(BlockPoolSlice.java:247)
at
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.shutdown(BlockPoolSlice.java:698)
at
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.shutdown(FsVolumeImpl.java:815)
at
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeList.removeVolume(FsVolumeList.java:328)
at
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeList.checkDirs(FsVolumeList.java:250)
at
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.checkDataDir(FsDatasetImpl.java:2011)
at
org.apache.hadoop.hdfs.server.datanode.DataNode.checkDiskError(DataNode.java:3145)
at
org.apache.hadoop.hdfs.server.datanode.DataNode.access$800(DataNode.java:243)
at
org.apache.hadoop.hdfs.server.datanode.DataNode$7.run(DataNode.java:3178)
at java.lang.Thread.run(Thread.java:745)
2016-02-24 16:05:53,286 [Thread-49277] INFO impl.FsDatasetImpl: Removed volume:
volumeA/current
2016-02-24 16:05:53,286 [Thread-49277] WARN impl.FsDatasetImpl: Completed
checkDirs. Found 1 failure volumes.
2016-02-24 16:05:53,286 [Thread-49277] INFO datanode.DataNode: Deactivating
volumes (clear failure=false): volumeA
2016-02-24 16:05:53,286 [Thread-49277] INFO impl.FsDatasetImpl: Removing
volumeA from FsDataset.
2016-02-24 16:05:53,342 [Thread-49277] INFO common.Storage: Removing block
level storage: volumeA/current/BP-1788428031-nnIp-1351700107344
2016-02-24 16:05:53,345 [Thread-49277] WARN datanode.DataNode:
DataNode.handleDiskError: Keep Running: true
{noformat}
The datanode waits for the reference count to become zero before shutting down
the volume.
{code:title=FsVolumeImpl.java|borderStyle=solid}
while (this.reference.getReferenceCount() > 0) {
if (FsDatasetImpl.LOG.isDebugEnabled()) {
FsDatasetImpl.LOG.debug(String.format(
"The reference count for %s is %d, wait to be 0.",
this, reference.getReferenceCount()));
}
try {
Thread.sleep(SLEEP_MILLIS);
} catch (InterruptedException e) {
throw new IOException(e);
}
}
{code}
Just before datanode logged the following line,
{noformat}
2016-02-24 16:05:53,285 [Thread-49277] WARN impl.FsDatasetImpl: Failed to
delete old dfsUsed file in
volumeA/current/BP-1788428031-nnIp-1351700107344/current
{noformat}
I saw the following stack trace
{noformat}
2016-02-24 16:05:53,017 [DataXceiver for client
DFSClient_NONMAPREDUCE_1651663681_1 at /upStreamDNIp:55710 [Receiving block
BP-1788428031-nnIp-1351700107344:blk_7059462432_1109832821906]] INFO
datanode.DataNode: Exception for
BP-1788428031-nnIp-1351700107344:blk_7059462432_1109832909736
java.io.IOException: Premature EOF from inputStream
at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:201)
at
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
at
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
at
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:501)
at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:895)
at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:801)
at
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137)
at
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74)
at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:253)
at java.lang.Thread.run(Thread.java:745)
2016-02-24 16:05:53,018 [PacketResponder:
BP-1788428031-nnIp-1351700107344:blk_7059462432_1109832909736,
type=LAST_IN_PIPELINE, downstreams=0:[]] INFO datanode.DataNode:
PacketResponder: BP-1788428031-nnIp-1351700107344:blk_7059462432_1109832909736,
type=LAST_IN_PIPELINE, downstreams=0:[]: Thread is interrupted.
2016-02-24 16:05:53,018 [PacketResponder:
BP-1788428031-nnIp-1351700107344:blk_7059462432_1109832909736,
type=LAST_IN_PIPELINE, downstreams=0:[]] INFO datanode.DataNode:
PacketResponder: BP-1788428031-nnIp-1351700107344:blk_7059462432_1109832909736,
type=LAST_IN_PIPELINE, downstreams=0:[] terminating
2016-02-24 16:05:53,018 [DataXceiver for client
DFSClient_NONMAPREDUCE_1651663681_1 at /upStreamDNIp:55710 [Receiving block
BP-1788428031-nnIp-1351700107344:blk_7059462432_1109832821906]] INFO
datanode.DataNode: opWriteBlock
BP-1788428031-nnIp-1351700107344:blk_7059462432_1109832909736 received
exception java.io.IOException: Prematur
e EOF from inputStream
2016-02-24 16:05:53,018 [DataXceiver for client
DFSClient_NONMAPREDUCE_1651663681_1 at /upStreamDNIp:55710 [Receiving block
BP-1788428031-nnIp-1351700107344:blk_7059462432_1109832821906]] ERROR
datanode.DataNode: thisDNName:1004:DataXceiver error processing WRITE_BLOCK
operation src: /upStreamDNIp:55710 dst: /thisDNIp
:1004
java.io.IOException: Premature EOF from inputStream
at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:201)
at
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
at
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
at
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:501)
at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:895)
at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:801)
at
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137)
at
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74)
at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:253)
at java.lang.Thread.run(Thread.java:745)
{noformat}
On tracking the block blk_7059462432_1109832821906, I see that the block was
created on 2016-02-17 15:06:28,256
{noformat}
2016-02-17 15:06:28,928 [DataXceiver for client
DFSClient_NONMAPREDUCE_1651663681_1 at /upStreamDNIp:55590 [Receiving block
BP-1788428031-nnIp-1351700107344:blk_7059462432_1109832821906]] INFO
datanode.DataNode: Receiving
BP-1788428031-nnIp-1351700107344:blk_7059462432_1109832821906 src:
/upStreamDNIp:55590 dest: /thisDNIp:1004
{noformat}
The job which created this file was running for more than 7 days and the client
eventually failed to renew the delegation token so the lease manager failed to
renew lease for this file.
Once commitBlockSynchronization kicked in, it tried to recover the block and
eventually the DataXceiver thread died decrementing the reference count.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)