[
https://issues.apache.org/jira/browse/HDFS-15414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
YCozy updated HDFS-15414:
-------------------------
Description:
We observed this exception in a DataNode's log while we are not shutting down
any nodes in the cluster. Specifically, we have a cluster with 3 DataNodes
(DN1, DN2, DN3) and 2 NameNodes (NN1, NN2). At some point, this exception
occurs in DN3's log:
{noformat}
2020-06-08 21:53:03,373 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
DatanodeRegistration(127.0.0.1:9666,
datanodeUuid=4408ff04-e406-4ccc-bd5c-8516ad57ec21, infoPort=9664,
infoSecurePort=0, ipcPort=9667,
storageInfo=lv=-57;cid=CID-c816c4ea-a559-4fd5-9b3a-b5994dc3a5fa;nsid=34747155;c=1591653120007)
Starting thread to transfer
BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 to 127.0.0.1:9766
2020-06-08 21:53:03,373 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
DatanodeRegistration(127.0.0.1:9666,
datanodeUuid=4408ff04-e406-4ccc-bd5c-8516ad57ec21, infoPort=9664,
infoSecurePort=0, ipcPort=9667,
storageInfo=lv=-57;cid=CID-c816c4ea-a559-4fd5-9b3a-b5994dc3a5fa;nsid=34747155;c=1591653120007)
Starting thread to transfer
BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 to 127.0.0.1:9766
2020-06-08 21:53:03,381 INFO
org.apache.hadoop.hdfs.server.datanode.checker.ThrottledAsyncChecker:
Scheduling a check for /app/dn3/current
2020-06-08 21:53:03,383 WARN org.apache.hadoop.hdfs.server.datanode.DataNode:
DatanodeRegistration(127.0.0.1:9666,
datanodeUuid=4408ff04-e406-4ccc-bd5c-8516ad57ec21, infoPort=9664,
infoSecurePort=0, ipcPort=9667,
storageInfo=lv=-57;cid=CID-c816c4ea-a559-4fd5-9b3a-b5994dc3a5fa;nsid=34747155;c=1591653120007):Failed
to transfer BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 to
127.0.0.1:9766 got
java.net.SocketException: Original Exception : java.io.IOException: Broken pipe
at sun.nio.ch.FileChannelImpl.transferTo0(Native Method)
at
sun.nio.ch.FileChannelImpl.transferToDirectlyInternal(FileChannelImpl.java:428)
at sun.nio.ch.FileChannelImpl.transferToDirectly(FileChannelImpl.java:493)
at sun.nio.ch.FileChannelImpl.transferTo(FileChannelImpl.java:605)
at
org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:223)
at
org.apache.hadoop.hdfs.server.datanode.FileIoProvider.transferToSocketFully(FileIoProvider.java:280)
at
org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:620)
at
org.apache.hadoop.hdfs.server.datanode.BlockSender.doSendBlock(BlockSender.java:804)
at
org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:751)
at
org.apache.hadoop.hdfs.server.datanode.DataNode$DataTransfer.run(DataNode.java:2469)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Broken pipe
... 11 more{noformat}
Port 9766 is DN2's address.
Around the same time, we observe the following exceptions in DN2's log:
{noformat}
2020-06-08 21:53:03,379 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
Receiving BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 src:
/127.0.0.1:47618 dest: /127.0.0.1:9766
2020-06-08 21:53:03,379 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
opWriteBlock BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 received
exception org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException:
Block BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 already exists
in state FINALIZED and thus cannot be created.
2020-06-08 21:53:03,379 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
007e9b383989:9766:DataXceiver error processing WRITE_BLOCK operation src:
/127.0.0.1:47618 dst: /127.0.0.1:9766;
org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block
BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 already exists in
state FINALIZED and thus cannot be created.{noformat}
However, this exception doesn't look like the cause of the broken pipe because
earlier DN2 has another occurrence of a ReplicaAlreadyExistsException, but DN3
only has one occurrence of broken pipe. Here's the other occurrence of
ReplicaAlreadyExistsException on DN2:
{noformat}
2020-06-08 21:52:54,438 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
Receiving BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1001 src:
/127.0.0.1:47462 dest: /127.0.0.1:9766
2020-06-08 21:52:54,438 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
opWriteBlock BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1001 received
exception org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException:
Block BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1001 already exists
in state FINALIZED and thus cannot be created.
2020-06-08 21:52:54,448 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
007e9b383989:9766:DataXceiver error processing WRITE_BLOCK operation src:
/127.0.0.1:47462 dst: /127.0.0.1:9766;
org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block
BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1001 already exists in
state FINALIZED and thus cannot be created.{noformat}
So we think there is a bug causing the broken pipe.
was:
We observed this exception in a DataNode's log while we are not shutting down
any nodes in the cluster. Specifically, we have a cluster with 3 DataNodes
(DN1, DN2, DN3) and 2 NameNodes (NN1, NN2). At some point, this exception
occurs in DN3's log:
{noformat}
2020-06-08 21:53:03,373 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
DatanodeRegistration(127.0.0.1:9666,
datanodeUuid=4408ff04-e406-4ccc-bd5c-8516ad57ec21, infoPort=9664,
infoSecurePort=0, ipcPort=9667,
storageInfo=lv=-57;cid=CID-c816c4ea-a559-4fd5-9b3a-b5994dc3a5fa;nsid=34747155;c=1591653120007)
Starting thread to transfer
BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 to 127.0.0.1:9766
2020-06-08 21:53:03,373 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
DatanodeRegistration(127.0.0.1:9666,
datanodeUuid=4408ff04-e406-4ccc-bd5c-8516ad57ec21, infoPort=9664,
infoSecurePort=0, ipcPort=9667,
storageInfo=lv=-57;cid=CID-c816c4ea-a559-4fd5-9b3a-b5994dc3a5fa;nsid=34747155;c=1591653120007)
Starting thread to transfer
BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 to 127.0.0.1:9766
2020-06-08 21:53:03,381 INFO
org.apache.hadoop.hdfs.server.datanode.checker.ThrottledAsyncChecker:
Scheduling a check for /app/dn3/current
2020-06-08 21:53:03,383 WARN org.apache.hadoop.hdfs.server.datanode.DataNode:
DatanodeRegistration(127.0.0.1:9666,
datanodeUuid=4408ff04-e406-4ccc-bd5c-8516ad57ec21, infoPort=9664,
infoSecurePort=0, ipcPort=9667,
storageInfo=lv=-57;cid=CID-c816c4ea-a559-4fd5-9b3a-b5994dc3a5fa;nsid=34747155;c=1591653120007):Failed
to transfer BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 to
127.0.0.1:9766 got
java.net.SocketException: Original Exception : java.io.IOException: Broken pipe
at sun.nio.ch.FileChannelImpl.transferTo0(Native Method)
at
sun.nio.ch.FileChannelImpl.transferToDirectlyInternal(FileChannelImpl.java:428)
at sun.nio.ch.FileChannelImpl.transferToDirectly(FileChannelImpl.java:493)
at sun.nio.ch.FileChannelImpl.transferTo(FileChannelImpl.java:605)
at
org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:223)
at
org.apache.hadoop.hdfs.server.datanode.FileIoProvider.transferToSocketFully(FileIoProvider.java:280)
at
org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:620)
at
org.apache.hadoop.hdfs.server.datanode.BlockSender.doSendBlock(BlockSender.java:804)
at
org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:751)
at
org.apache.hadoop.hdfs.server.datanode.DataNode$DataTransfer.run(DataNode.java:2469)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Broken pipe
... 11 more{noformat}
Port 9766 is DN2's address.
Around the same time, we observe the following exceptions in DN2's log:
{noformat}
2020-06-08 21:53:03,379 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
Receiving BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 src:
/127.0.0.1:47618 dest: /127.0.0.1:9766
2020-06-08 21:53:03,379 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
opWriteBlock BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 received
exception org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException:
Block BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 already exists
in state FINALIZED and thus cannot be created.
2020-06-08 21:53:03,379 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
007e9b383989:9766:DataXceiver error processing WRITE_BLOCK operation src:
/127.0.0.1:47618 dst: /127.0.0.1:9766;
org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block
BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 already exists in
state FINALIZED and thus cannot be created.{noformat}
However, this exception does look like the cause of the broken pipe because
earlier DN2 has another occurrence of a ReplicaAlreadyExistsException, but DN3
only has one occurrence of broken pipe. Here's the other occurrence of
ReplicaAlreadyExistsException on DN2:
{noformat}
2020-06-08 21:52:54,438 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
Receiving BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1001 src:
/127.0.0.1:47462 dest: /127.0.0.1:9766
2020-06-08 21:52:54,438 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
opWriteBlock BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1001 received
exception org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException:
Block BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1001 already exists
in state FINALIZED and thus cannot be created.
2020-06-08 21:52:54,448 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
007e9b383989:9766:DataXceiver error processing WRITE_BLOCK operation src:
/127.0.0.1:47462 dst: /127.0.0.1:9766;
org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block
BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1001 already exists in
state FINALIZED and thus cannot be created.{noformat}
So we think there is a bug causing the broken pipe.
> java.net.SocketException: Original Exception : java.io.IOException: Broken
> pipe
> -------------------------------------------------------------------------------
>
> Key: HDFS-15414
> URL: https://issues.apache.org/jira/browse/HDFS-15414
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: datanode
> Affects Versions: 2.10.0
> Reporter: YCozy
> Priority: Major
>
> We observed this exception in a DataNode's log while we are not shutting down
> any nodes in the cluster. Specifically, we have a cluster with 3 DataNodes
> (DN1, DN2, DN3) and 2 NameNodes (NN1, NN2). At some point, this exception
> occurs in DN3's log:
> {noformat}
> 2020-06-08 21:53:03,373 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
> DatanodeRegistration(127.0.0.1:9666,
> datanodeUuid=4408ff04-e406-4ccc-bd5c-8516ad57ec21, infoPort=9664,
> infoSecurePort=0, ipcPort=9667,
> storageInfo=lv=-57;cid=CID-c816c4ea-a559-4fd5-9b3a-b5994dc3a5fa;nsid=34747155;c=1591653120007)
> Starting thread to transfer
> BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 to 127.0.0.1:9766
> 2020-06-08 21:53:03,373 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
> DatanodeRegistration(127.0.0.1:9666,
> datanodeUuid=4408ff04-e406-4ccc-bd5c-8516ad57ec21, infoPort=9664,
> infoSecurePort=0, ipcPort=9667,
> storageInfo=lv=-57;cid=CID-c816c4ea-a559-4fd5-9b3a-b5994dc3a5fa;nsid=34747155;c=1591653120007)
> Starting thread to transfer
> BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 to 127.0.0.1:9766
> 2020-06-08 21:53:03,381 INFO
> org.apache.hadoop.hdfs.server.datanode.checker.ThrottledAsyncChecker:
> Scheduling a check for /app/dn3/current
> 2020-06-08 21:53:03,383 WARN org.apache.hadoop.hdfs.server.datanode.DataNode:
> DatanodeRegistration(127.0.0.1:9666,
> datanodeUuid=4408ff04-e406-4ccc-bd5c-8516ad57ec21, infoPort=9664,
> infoSecurePort=0, ipcPort=9667,
> storageInfo=lv=-57;cid=CID-c816c4ea-a559-4fd5-9b3a-b5994dc3a5fa;nsid=34747155;c=1591653120007):Failed
> to transfer BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 to
> 127.0.0.1:9766 got
> java.net.SocketException: Original Exception : java.io.IOException: Broken
> pipe
> at sun.nio.ch.FileChannelImpl.transferTo0(Native Method)
> at
> sun.nio.ch.FileChannelImpl.transferToDirectlyInternal(FileChannelImpl.java:428)
> at sun.nio.ch.FileChannelImpl.transferToDirectly(FileChannelImpl.java:493)
> at sun.nio.ch.FileChannelImpl.transferTo(FileChannelImpl.java:605)
> at
> org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:223)
>
> at
> org.apache.hadoop.hdfs.server.datanode.FileIoProvider.transferToSocketFully(FileIoProvider.java:280)
>
> at
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:620)
>
> at
> org.apache.hadoop.hdfs.server.datanode.BlockSender.doSendBlock(BlockSender.java:804)
>
> at
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:751)
>
> at
> org.apache.hadoop.hdfs.server.datanode.DataNode$DataTransfer.run(DataNode.java:2469)
>
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.IOException: Broken pipe
> ... 11 more{noformat}
> Port 9766 is DN2's address.
> Around the same time, we observe the following exceptions in DN2's log:
> {noformat}
> 2020-06-08 21:53:03,379 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
> Receiving BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 src:
> /127.0.0.1:47618 dest: /127.0.0.1:9766
> 2020-06-08 21:53:03,379 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
> opWriteBlock BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002
> received exception
> org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block
> BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 already exists in
> state FINALIZED and thus cannot be created.
> 2020-06-08 21:53:03,379 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
> 007e9b383989:9766:DataXceiver error processing WRITE_BLOCK operation src:
> /127.0.0.1:47618 dst: /127.0.0.1:9766;
> org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block
> BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1002 already exists in
> state FINALIZED and thus cannot be created.{noformat}
> However, this exception doesn't look like the cause of the broken pipe
> because earlier DN2 has another occurrence of a
> ReplicaAlreadyExistsException, but DN3 only has one occurrence of broken
> pipe. Here's the other occurrence of ReplicaAlreadyExistsException on DN2:
> {noformat}
> 2020-06-08 21:52:54,438 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
> Receiving BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1001 src:
> /127.0.0.1:47462 dest: /127.0.0.1:9766
> 2020-06-08 21:52:54,438 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
> opWriteBlock BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1001
> received exception
> org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block
> BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1001 already exists in
> state FINALIZED and thus cannot be created.
> 2020-06-08 21:52:54,448 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
> 007e9b383989:9766:DataXceiver error processing WRITE_BLOCK operation src:
> /127.0.0.1:47462 dst: /127.0.0.1:9766;
> org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block
> BP-553302063-172.17.0.3-1591653120007:blk_1073741825_1001 already exists in
> state FINALIZED and thus cannot be created.{noformat}
> So we think there is a bug causing the broken pipe.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]