[
https://issues.apache.org/jira/browse/HADOOP-16677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Xudong Cao updated HADOOP-16677:
--------------------------------
Status: Open (was: Patch Available)
> Recalculate the remaining timeout millis correctly while throwing an
> InterupptedException in SocketIOWithTimeout.
> -----------------------------------------------------------------------------------------------------------------
>
> Key: HADOOP-16677
> URL: https://issues.apache.org/jira/browse/HADOOP-16677
> Project: Hadoop Common
> Issue Type: Bug
> Components: common
> Affects Versions: 3.1.3
> Reporter: Xudong Cao
> Assignee: Xudong Cao
> Priority: Minor
>
> In SocketIOWithTimeout, when a thread was interrupted and exit from select(),
> it proceed to throw an InterruptedIOException, in exception message the
> remaining timeout mills should be recalcuated correctly rather than simply
> give a total timeout millis , otherwise it could be very misleading.
> For example, if an hdfs writer has not sent any packet to the pipeline more
> than 60s (e.g. full gc or network issues), then one of pipeline datanodes
> may be timeout and close its sockets to other dns, so its upstream DN's
> PacketResponder will immediately meet an EOF and then interrupt its own
> DataXceiver, finally its DataXeiver will print some logs like:
> {code:java}
> 2019-10-24 09:22:58,212 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
> Exception for
> BP-753871533-10.215.131.216-1511957392115:blk_10646544613_95736750382019-10-24
> 09:22:58,212 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception
> for BP-753871533-10.215.131.216-1511957392115:blk_10646544613_9573675038
> java.io.InterruptedIOException: Interrupted while waiting for IO on channel
> java.nio.channels.SocketChannel[connected local=/10.196.146.114:9003
> remote=/10.215.153.105:38559]. 60000 millis timeout left.
> at
> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:342)
> at
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
> ...
>
> {code}
> However, This log is wrong and very misleading, a 60000 mills timeout left
> implies that the DataXceiver has never blocked in select(), but this is not
> the actual situation, In fact, the truly timeout mills left is 60000 -
> timeElapsedWhenSelect, rather than simply 60000.
> Finally, a properly log should be like this, the key improvment is "*Total
> timeout mills is 60000, 1000 millis timeout left*":
> {code:java}
> 2019-10-31 16:20:39,496 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
> Exception for
> BP-1895911599-192.168.202.11-1572488735889:blk_1073741833_10162019-10-31
> 16:20:39,496 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception
> for BP-1895911599-192.168.202.11-1572488735889:blk_1073741833_1016
> java.io.InterruptedIOException: Interrupted while waiting for IO on channel
> java.nio.channels.SocketChannel[connected local=/192.168.202.12:50010
> remote=/192.168.202.11:57006]. Total timeout mills is 60000, 1000 millis
> timeout left.
> at
> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:351)
> at
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
> ...
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]