Yongjun Zhang created HDFS-10624: ------------------------------------ Summary: VolumeScanner to report why a block is found bad Key: HDFS-10624 URL: https://issues.apache.org/jira/browse/HDFS-10624 Project: Hadoop HDFS Issue Type: Improvement Components: datanode, hdfs Reporter: Yongjun Zhang
Seeing the following on DN log. {code} 2016-04-07 20:27:45,416 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: opWriteBlock BP-1800173197-10.204.68.5-1444425156296:blk_1170125248_96465013 received exception java.io.EOFException: Premature EOF: no length prefix available 2016-04-07 20:27:45,416 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: rn2-lampp-lapp1115.rno.apple.com:1110:DataXceiver error processing WRITE_BLOCK operation src: /10.204.64.137:45112 dst: /10.204.64.151:1110 java.io.EOFException: Premature EOF: no length prefix available at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2241) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:738) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:169) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:106) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:244) at java.lang.Thread.run(Thread.java:745) 2016-04-07 20:27:46,116 WARN org.apache.hadoop.hdfs.server.datanode.VolumeScanner: Reporting bad BP-1800173197-10.204.68.5-1444425156296:blk_1170125248_96458336 on /ngs8/app/lampp/dfs/dn 2016-04-07 20:27:46,117 ERROR org.apache.hadoop.hdfs.server.datanode.VolumeScanner: VolumeScanner(/ngs8/app/lampp/dfs/dn, DS-a14baf2b-a1ef-4282-8d88-3203438e708e) exiting because of exception java.lang.NullPointerException at org.apache.hadoop.hdfs.server.datanode.DataNode.reportBadBlocks(DataNode.java:1018) at org.apache.hadoop.hdfs.server.datanode.VolumeScanner$ScanResultHandler.handle(VolumeScanner.java:287) at org.apache.hadoop.hdfs.server.datanode.VolumeScanner.scanBlock(VolumeScanner.java:443) at org.apache.hadoop.hdfs.server.datanode.VolumeScanner.runLoop(VolumeScanner.java:547) at org.apache.hadoop.hdfs.server.datanode.VolumeScanner.run(VolumeScanner.java:621) 2016-04-07 20:27:46,118 INFO org.apache.hadoop.hdfs.server.datanode.VolumeScanner: VolumeScanner(/ngs8/app/lampp/dfs/dn, DS-a14baf2b-a1ef-4282-8d88-3203438e708e) exiting. 2016-04-07 20:27:46,442 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(10.204.64.151, datanodeUuid=6064994a-6769-4192-9377-83f78bd3d7a6, infoPort=0, infoSecurePort=1175, ipcPort=1120, storageInfo=lv=-56;cid=cluster6;nsid=1112595121;c=0):Failed to transfer BP-1800173197-10.204.68.5-1444425156296:blk_1170125248_96465013 to 10.204.64.10:1110 got java.net.SocketException: Original Exception : java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcherImpl.write0(Native Method) at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) at sun.nio.ch.IOUtil.write(IOUtil.java:65) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471) at org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:63) at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142) at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:159) at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:117) at java.io.DataOutputStream.write(DataOutputStream.java:107) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82) at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126) at org.apache.hadoop.security.SaslOutputStream.write(SaslOutputStream.java:190) at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122) at java.io.DataOutputStream.write(DataOutputStream.java:107) at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:585) at org.apache.hadoop.hdfs.server.datanode.BlockSender.doSendBlock(BlockSender.java:758) at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:705) at org.apache.hadoop.hdfs.server.datanode.DataNode$DataTransfer.run(DataNode.java:2154) at org.apache.hadoop.hdfs.server.datanode.DataNode.transferReplicaForPipelineRecovery(DataNode.java:2884) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.transferBlock(DataXceiver.java:862) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opTransferBlock(Receiver.java:200) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:118) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:244) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: Connection reset by peer ... 25 more {code} Particularly {code} 2016-04-07 20:27:46,116 WARN org.apache.hadoop.hdfs.server.datanode.VolumeScanner: Reporting bad BP-1800173197-10.204.68.5-1444425156296:blk_1170125248_96458336 on /ngs8/app/lampp/dfs/dn {code} means VolumeScanner/BlockScanner found the replica corrupt, or have other issue. It would be very helpful to report the reason here. If it's corrupt, where is the first corrupt data (or chunk) in the block, and the total replica length. Creating this jira to request this enhancement. BTW, the NPE in the above log was resolved as HDFS-10512 (thanks Wei-Chiu). -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org