Yongjun Zhang created HDFS-10624:
------------------------------------

             Summary: VolumeScanner to report why a block is found bad
                 Key: HDFS-10624
                 URL: https://issues.apache.org/jira/browse/HDFS-10624
             Project: Hadoop HDFS
          Issue Type: Improvement
          Components: datanode, hdfs
            Reporter: Yongjun Zhang


Seeing the following on DN log. 

{code}
2016-04-07 20:27:45,416 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
opWriteBlock BP-1800173197-10.204.68.5-1444425156296:blk_1170125248_96465013 
received exception java.io.EOFException: Premature EOF: no length prefix 
available
2016-04-07 20:27:45,416 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: 
rn2-lampp-lapp1115.rno.apple.com:1110:DataXceiver error processing WRITE_BLOCK 
operation  src: /10.204.64.137:45112 dst: /10.204.64.151:1110
java.io.EOFException: Premature EOF: no length prefix available
        at 
org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2241)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:738)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:169)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:106)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:244)
        at java.lang.Thread.run(Thread.java:745)
2016-04-07 20:27:46,116 WARN 
org.apache.hadoop.hdfs.server.datanode.VolumeScanner: Reporting bad 
BP-1800173197-10.204.68.5-1444425156296:blk_1170125248_96458336 on 
/ngs8/app/lampp/dfs/dn
2016-04-07 20:27:46,117 ERROR 
org.apache.hadoop.hdfs.server.datanode.VolumeScanner: 
VolumeScanner(/ngs8/app/lampp/dfs/dn, DS-a14baf2b-a1ef-4282-8d88-3203438e708e) 
exiting because of exception
java.lang.NullPointerException
        at 
org.apache.hadoop.hdfs.server.datanode.DataNode.reportBadBlocks(DataNode.java:1018)
        at 
org.apache.hadoop.hdfs.server.datanode.VolumeScanner$ScanResultHandler.handle(VolumeScanner.java:287)
        at 
org.apache.hadoop.hdfs.server.datanode.VolumeScanner.scanBlock(VolumeScanner.java:443)
        at 
org.apache.hadoop.hdfs.server.datanode.VolumeScanner.runLoop(VolumeScanner.java:547)
        at 
org.apache.hadoop.hdfs.server.datanode.VolumeScanner.run(VolumeScanner.java:621)
2016-04-07 20:27:46,118 INFO 
org.apache.hadoop.hdfs.server.datanode.VolumeScanner: 
VolumeScanner(/ngs8/app/lampp/dfs/dn, DS-a14baf2b-a1ef-4282-8d88-3203438e708e) 
exiting.
2016-04-07 20:27:46,442 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
DatanodeRegistration(10.204.64.151, 
datanodeUuid=6064994a-6769-4192-9377-83f78bd3d7a6, infoPort=0, 
infoSecurePort=1175, ipcPort=1120, 
storageInfo=lv=-56;cid=cluster6;nsid=1112595121;c=0):Failed to transfer 
BP-1800173197-10.204.68.5-1444425156296:blk_1170125248_96465013 to 
10.204.64.10:1110 got
java.net.SocketException: Original Exception : java.io.IOException: Connection 
reset by peer
        at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
        at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
        at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
        at sun.nio.ch.IOUtil.write(IOUtil.java:65)
        at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471)
        at 
org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:63)
        at 
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
        at 
org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:159)
        at 
org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:117)
        at java.io.DataOutputStream.write(DataOutputStream.java:107)
        at 
java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
        at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
        at 
org.apache.hadoop.security.SaslOutputStream.write(SaslOutputStream.java:190)
        at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122)
        at java.io.DataOutputStream.write(DataOutputStream.java:107)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:585)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockSender.doSendBlock(BlockSender.java:758)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:705)
        at 
org.apache.hadoop.hdfs.server.datanode.DataNode$DataTransfer.run(DataNode.java:2154)
        at 
org.apache.hadoop.hdfs.server.datanode.DataNode.transferReplicaForPipelineRecovery(DataNode.java:2884)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.transferBlock(DataXceiver.java:862)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opTransferBlock(Receiver.java:200)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:118)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:244)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Connection reset by peer
        ... 25 more
{code}

Particularly

{code}
2016-04-07 20:27:46,116 WARN 
org.apache.hadoop.hdfs.server.datanode.VolumeScanner: Reporting bad 
BP-1800173197-10.204.68.5-1444425156296:blk_1170125248_96458336 on 
/ngs8/app/lampp/dfs/dn
{code}
means VolumeScanner/BlockScanner found the replica corrupt, or have other 
issue. It would be very helpful to report the reason here. If it's corrupt, 
where is the first corrupt data (or chunk) in the block, and the total replica 
length. Creating this jira to request this enhancement.
 
BTW, the NPE in the above log was resolved as HDFS-10512 (thanks Wei-Chiu).




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to