YCozy created HDFS-15414: ---------------------------- Summary: java.net.SocketException: Original Exception : java.io.IOException: Broken pipe Key: HDFS-15414 URL: https://issues.apache.org/jira/browse/HDFS-15414 Project: Hadoop HDFS Issue Type: Bug Components: datanode Affects Versions: 2.10.0 Reporter: YCozy
We observed this exception in a DataNode's log while we are not shutting down any nodes in the cluster. Specifically, we have a cluster with 3 DataNodes (DN1, DN2, DN3) and 2 NameNodes (NN1, NN2). At some point, this exception occurs in DN3's log: {noformat} 2020-06-08 21:53:03,373 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(127.0.0.1:9666, datanodeUuid=4408ff04-e406-4ccc-bd5c-8516ad57ec21, infoPort=9664, infoSecurePort=0, ipcPort=9667, storageInfo=lv=-57;cid=CID-c816c4ea-a559-4fd5-9b3a-b5994dc3a5fa;nsid=34747155;c=1591653120007) Starting thread to transfer BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 to 127.0.0.1:9766 2020-06-08 21:53:03,373 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(127.0.0.1:9666, datanodeUuid=4408ff04-e406-4ccc-bd5c-8516ad57ec21, infoPort=9664, infoSecurePort=0, ipcPort=9667, storageInfo=lv=-57;cid=CID-c816c4ea-a559-4fd5-9b3a-b5994dc3a5fa;nsid=34747155;c=1591653120007) Starting thread to transfer BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 to 127.0.0.1:9766 2020-06-08 21:53:03,381 INFO org.apache.hadoop.hdfs.server.datanode.checker.ThrottledAsyncChecker: Scheduling a check for /app/dn3/current 2020-06-08 21:53:03,383 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(127.0.0.1:9666, datanodeUuid=4408ff04-e406-4ccc-bd5c-8516ad57ec21, infoPort=9664, infoSecurePort=0, ipcPort=9667, storageInfo=lv=-57;cid=CID-c816c4ea-a559-4fd5-9b3a-b5994dc3a5fa;nsid=34747155;c=1591653120007):Failed to transfer BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 to 127.0.0.1:9766 got java.net.SocketException: Original Exception : java.io.IOException: Broken pipe at sun.nio.ch.FileChannelImpl.transferTo0(Native Method) at sun.nio.ch.FileChannelImpl.transferToDirectlyInternal(FileChannelImpl.java:428) at sun.nio.ch.FileChannelImpl.transferToDirectly(FileChannelImpl.java:493) at sun.nio.ch.FileChannelImpl.transferTo(FileChannelImpl.java:605) at org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:223) at org.apache.hadoop.hdfs.server.datanode.FileIoProvider.transferToSocketFully(FileIoProvider.java:280) at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:620) at org.apache.hadoop.hdfs.server.datanode.BlockSender.doSendBlock(BlockSender.java:804) at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:751) at org.apache.hadoop.hdfs.server.datanode.DataNode$DataTransfer.run(DataNode.java:2469) at java.lang.Thread.run(Thread.java:748) Caused by: java.io.IOException: Broken pipe ... 11 more{noformat} Port 9766 is DN2's address. Around the same time, we observe the following exceptions in DN2's log: {noformat} 2020-06-08 21:53:03,379 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 src: /127.0.0.1:47618 dest: /127.0.0.1:9766 2020-06-08 21:53:03,379 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: opWriteBlock BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 received exception org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block BP-553302063-172.17.0.3- 1591653120007:blk_1073741825_1002 already exists in state FINALIZED and thus cannot be created. 2020-06-08 21:53:03,379 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 007e9b383989:9766:DataXceiver error processing WRITE_BLOCK operation src: /127.0.0.1:47618 dst: /127.0.0.1:9766; org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 already exists in state FINALIZED and thus cannot be created.{noformat} However, this exception does look like the cause of the broken pipe because earlier DN2 has another occurrence of a ReplicaAlreadyExistsException, but DN3 only has one occurrence of broken pipe. Here's the other occurrence of ReplicaAlreadyExistsException on DN2: {noformat} 2020-06-08 21:52:54,438 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1001 src: /127.0.0.1:47462 dest: /127.0.0.1:9766 2020-06-08 21:52:54,438 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: opWriteBlock BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1001 received exception org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block BP-553302063-172.17.0.3- 1591653120007:blk_1073741825_1001 already exists in state FINALIZED and thus cannot be created. 2020-06-08 21:52:54,448 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 007e9b383989:9766:DataXceiver error processing WRITE_BLOCK operation src: /127.0.0.1:47462 dst: /127.0.0.1:9766; org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1001 already exists in state FINALIZED and thus cannot be created.{noformat} So we think there is a bug causing the broken pipe. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org