[ 
https://issues.apache.org/jira/browse/HDFS-9874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15205092#comment-15205092
 ] 

Rushabh S Shah commented on HDFS-9874:
--------------------------------------

The NPE is expected.
{quote}
 at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getBlockReports(FsDatasetImpl.java:1714)
at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.shutdownBlockPool(FsDatasetImpl.java:2591)
at 
org.apache.hadoop.hdfs.server.datanode.DataNode.shutdownBlockPool(DataNode.java:1479)
at 
org.apache.hadoop.hdfs.server.datanode.BPOfferService.shutdownActor(BPOfferService.java:411)
{quote}
This is getting called while shutting down the cluster.
This is expected since I have triggered only a part of checkDiskError thread.

{code:title=DataNode.java|borderStyle=solid}
 private void checkDiskError() {
    Set<File> unhealthyDataDirs = data.checkDataDir();
    if (unhealthyDataDirs != null && !unhealthyDataDirs.isEmpty()) {
      try {
        // Remove all unhealthy volumes from DataNode.
        removeVolumes(unhealthyDataDirs, false);
      } catch (IOException e) { 
        LOG.warn("Error occurred when removing unhealthy storage dirs: "
            + e.getMessage(), e);
      }    
      StringBuilder sb = new StringBuilder("DataNode failed volumes:");
      for (File dataDir : unhealthyDataDirs) {
        sb.append(dataDir.getAbsolutePath() + ";");
      }    
      handleDiskError(sb.toString());
    }    
  }
{code}

I have only called the first line of the above function in the test case since 
I don't want the test case to wait for DataNode#checkDiskErrorInterval (which 
is 5 secs if defualt).
That's why it will not execute  removeVolumes(unhealthyDataDirs, false)
Therefore the NPE.

I am not able to reproduce the test case failing on my local machine on jdk 7 
and jdk8.
[~jojochuang]: Does it fail on your machine ?

> Long living DataXceiver threads cause volume shutdown to block.
> ---------------------------------------------------------------
>
>                 Key: HDFS-9874
>                 URL: https://issues.apache.org/jira/browse/HDFS-9874
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode
>    Affects Versions: 2.7.0
>            Reporter: Rushabh S Shah
>            Assignee: Rushabh S Shah
>            Priority: Critical
>             Fix For: 2.7.3
>
>         Attachments: HDFS-9874-trunk-1.patch, HDFS-9874-trunk-2.patch, 
> HDFS-9874-trunk.patch
>
>
> One of the failed volume shutdown took 3 days to complete.
> Below are the relevant datanode logs while shutting down a volume (due to 
> disk failure)
> {noformat}
> 2016-02-21 10:12:55,333 [Thread-49277] WARN impl.FsDatasetImpl: Removing 
> failed volume volumeA/current: 
> org.apache.hadoop.util.DiskChecker$DiskErrorException: Directory is not 
> writable: volumeA/current/BP-1788428031-nnIp-1351700107344/current/finalized
>         at 
> org.apache.hadoop.util.DiskChecker.checkAccessByFileMethods(DiskChecker.java:194)
>         at 
> org.apache.hadoop.util.DiskChecker.checkDirAccess(DiskChecker.java:174)
>         at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:108)
>         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.checkDirs(BlockPoolSlice.java:308)
>         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.checkDirs(FsVolumeImpl.java:786)
>         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeList.checkDirs(FsVolumeList.java:242)
>         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.checkDataDir(FsDatasetImpl.java:2011)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.checkDiskError(DataNode.java:3145)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.access$800(DataNode.java:243)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataNode$7.run(DataNode.java:3178)
>         at java.lang.Thread.run(Thread.java:745)
> 2016-02-21 10:12:55,334 [Thread-49277] INFO datanode.BlockScanner: Removing 
> scanner for volume volumeA (StorageID DS-cd2ea223-bab3-4361-a567-5f3f27a5dd23)
> 2016-02-21 10:12:55,334 [VolumeScannerThread(volumeA)] INFO 
> datanode.VolumeScanner: VolumeScanner(volumeA, 
> DS-cd2ea223-bab3-4361-a567-5f3f27a5dd23) exiting.
> 2016-02-21 10:12:55,335 [VolumeScannerThread(volumeA)] WARN 
> datanode.VolumeScanner: VolumeScanner(volumeA, 
> DS-cd2ea223-bab3-4361-a567-5f3f27a5dd23): error saving 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl$BlockIteratorImpl@4169ad8b.
> java.io.FileNotFoundException: 
> volumeA/current/BP-1788428031-nnIp-1351700107344/scanner.cursor.tmp 
> (Read-only file system)
>         at java.io.FileOutputStream.open(Native Method)
>         at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
>         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl$BlockIteratorImpl.save(FsVolumeImpl.java:669)
>         at 
> org.apache.hadoop.hdfs.server.datanode.VolumeScanner.saveBlockIterator(VolumeScanner.java:314)
>         at 
> org.apache.hadoop.hdfs.server.datanode.VolumeScanner.run(VolumeScanner.java:633)
> 2016-02-24 16:05:53,285 [Thread-49277] WARN impl.FsDatasetImpl: Failed to 
> delete old dfsUsed file in 
> volumeA/current/BP-1788428031-nnIp-1351700107344/current
> 2016-02-24 16:05:53,286 [Thread-49277] WARN impl.FsDatasetImpl: Failed to 
> write dfsUsed to 
> volumeA/current/BP-1788428031-nnIp-1351700107344/current/dfsUsed
> java.io.FileNotFoundException: 
> volumeA/current/BP-1788428031-nnIp-1351700107344/current/dfsUsed (Read-only 
> file system)
>               at java.io.FileOutputStream.open(Native Method)
>               at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
>               at java.io.FileOutputStream.<init>(FileOutputStream.java:162)
>               at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.saveDfsUsed(BlockPoolSlice.java:247)
>               at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.shutdown(BlockPoolSlice.java:698)
>               at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.shutdown(FsVolumeImpl.java:815)
>               at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeList.removeVolume(FsVolumeList.java:328)
>               at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeList.checkDirs(FsVolumeList.java:250)
>               at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.checkDataDir(FsDatasetImpl.java:2011)
>               at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.checkDiskError(DataNode.java:3145)
>               at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.access$800(DataNode.java:243)
>               at 
> org.apache.hadoop.hdfs.server.datanode.DataNode$7.run(DataNode.java:3178)
>               at java.lang.Thread.run(Thread.java:745)
> 2016-02-24 16:05:53,286 [Thread-49277] INFO impl.FsDatasetImpl: Removed 
> volume: volumeA/current
> 2016-02-24 16:05:53,286 [Thread-49277] WARN impl.FsDatasetImpl: Completed 
> checkDirs. Found 1 failure volumes.
> 2016-02-24 16:05:53,286 [Thread-49277] INFO datanode.DataNode: Deactivating 
> volumes (clear failure=false): volumeA
> 2016-02-24 16:05:53,286 [Thread-49277] INFO impl.FsDatasetImpl: Removing 
> volumeA from FsDataset.
> 2016-02-24 16:05:53,342 [Thread-49277] INFO common.Storage: Removing block 
> level storage: volumeA/current/BP-1788428031-nnIp-1351700107344
> 2016-02-24 16:05:53,345 [Thread-49277] WARN datanode.DataNode: 
> DataNode.handleDiskError: Keep Running: true
> {noformat} 
> The datanode waits for the reference count to become zero before shutting 
> down the volume.
> {code:title=FsVolumeImpl.java|borderStyle=solid}
> while (this.reference.getReferenceCount() > 0) {
>      if (FsDatasetImpl.LOG.isDebugEnabled()) {
>        FsDatasetImpl.LOG.debug(String.format(
>            "The reference count for %s is %d, wait to be 0.",
>            this, reference.getReferenceCount()));
>      }
>      try {
>        Thread.sleep(SLEEP_MILLIS);
>      } catch (InterruptedException e) {
>        throw new IOException(e);
>      }
>    }
> {code}
> Just before datanode logged the following line, 
> {noformat} 
> 2016-02-24 16:05:53,285 [Thread-49277] WARN impl.FsDatasetImpl: Failed to 
> delete old dfsUsed file in 
> volumeA/current/BP-1788428031-nnIp-1351700107344/current
> {noformat}
> I saw the following stack trace
> {noformat}
> 2016-02-24 16:05:53,017 [DataXceiver for client 
> DFSClient_NONMAPREDUCE_1651663681_1 at /upStreamDNIp:55710 [Receiving block 
> BP-1788428031-nnIp-1351700107344:blk_7059462432_1109832821906]] INFO 
> datanode.DataNode: Exception for 
> BP-1788428031-nnIp-1351700107344:blk_7059462432_1109832909736
> java.io.IOException: Premature EOF from inputStream
>         at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:201)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:501)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:895)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:801)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:253)
>         at java.lang.Thread.run(Thread.java:745)
> 2016-02-24 16:05:53,018 [PacketResponder: 
> BP-1788428031-nnIp-1351700107344:blk_7059462432_1109832909736, 
> type=LAST_IN_PIPELINE, downstreams=0:[]] INFO datanode.DataNode: 
> PacketResponder: 
> BP-1788428031-nnIp-1351700107344:blk_7059462432_1109832909736, 
> type=LAST_IN_PIPELINE, downstreams=0:[]: Thread is interrupted.
> 2016-02-24 16:05:53,018 [PacketResponder: 
> BP-1788428031-nnIp-1351700107344:blk_7059462432_1109832909736, 
> type=LAST_IN_PIPELINE, downstreams=0:[]] INFO datanode.DataNode: 
> PacketResponder: 
> BP-1788428031-nnIp-1351700107344:blk_7059462432_1109832909736, 
> type=LAST_IN_PIPELINE, downstreams=0:[] terminating
> 2016-02-24 16:05:53,018 [DataXceiver for client 
> DFSClient_NONMAPREDUCE_1651663681_1 at /upStreamDNIp:55710 [Receiving block 
> BP-1788428031-nnIp-1351700107344:blk_7059462432_1109832821906]] INFO 
> datanode.DataNode: opWriteBlock 
> BP-1788428031-nnIp-1351700107344:blk_7059462432_1109832909736 received 
> exception java.io.IOException: Prematur
> e EOF from inputStream
> 2016-02-24 16:05:53,018 [DataXceiver for client 
> DFSClient_NONMAPREDUCE_1651663681_1 at /upStreamDNIp:55710 [Receiving block 
> BP-1788428031-nnIp-1351700107344:blk_7059462432_1109832821906]] ERROR 
> datanode.DataNode: thisDNName:1004:DataXceiver error processing WRITE_BLOCK 
> operation  src: /upStreamDNIp:55710 dst: /thisDNIp
> :1004
> java.io.IOException: Premature EOF from inputStream
>         at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:201)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:501)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:895)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:801)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:253)
>         at java.lang.Thread.run(Thread.java:745)
> {noformat}
> On tracking the block blk_7059462432_1109832821906, I see that the block was 
> created on 2016-02-17 15:06:28,256
> {noformat}
> 2016-02-17 15:06:28,928 [DataXceiver for client 
> DFSClient_NONMAPREDUCE_1651663681_1 at /upStreamDNIp:55590 [Receiving block 
> BP-1788428031-nnIp-1351700107344:blk_7059462432_1109832821906]] INFO 
> datanode.DataNode: Receiving 
> BP-1788428031-nnIp-1351700107344:blk_7059462432_1109832821906 src: 
> /upStreamDNIp:55590 dest: /thisDNIp:1004
> {noformat}
> The job which created this file was running for more than 7 days and the 
> client eventually failed to renew the delegation token so the lease manager 
> failed to renew lease for this file.
> Once commitBlockSynchronization kicked in, it tried to recover the block and 
> eventually the DataXceiver thread died decrementing the reference count.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to