[
https://issues.apache.org/jira/browse/HADOOP-16677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Xudong Cao updated HADOOP-16677:
--------------------------------
Summary: Recalculate the remaining timeout millis correctly while throwing
an InterupptedException in SocketIOWithTimeout. (was: The remaining timeout
millis for constructing InterupptedException also need to be recalculated
correctly when thread is interrupted while blocking in select().)
> Recalculate the remaining timeout millis correctly while throwing an
> InterupptedException in SocketIOWithTimeout.
> -----------------------------------------------------------------------------------------------------------------
>
> Key: HADOOP-16677
> URL: https://issues.apache.org/jira/browse/HADOOP-16677
> Project: Hadoop Common
> Issue Type: Bug
> Components: common
> Affects Versions: 3.1.3
> Reporter: Xudong Cao
> Assignee: Xudong Cao
> Priority: Minor
>
> In SocketIOWithTimeout, when a thread was interrupted and exit from select(),
> it proceed to throw an InterruptedIOException, in exception message the
> remaining timeout mills should be recalcuated correctly rather than simply
> give a total timeout millis, otherwise it could be very misleading.
> For example, if an hdfs writer has not sent any packet to the pipeline more
> than 60s (e.g. full gc or network issues), then one of pipeline datanodes
> may be timeout and close its sockets to other dns, so its upstream DN's
> PacketResponder will immediately meet an EOF and then interrupt its own
> DataXceiver, finally its DataXeiver will print some logs like:
>
> {code:java}
> 2019-10-24 09:22:58,212 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
> Exception for
> BP-753871533-10.215.131.216-1511957392115:blk_10646544613_95736750382019-10-24
> 09:22:58,212 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception
> for
> BP-753871533-10.215.131.216-1511957392115:blk_10646544613_9573675038java.io.InterruptedIOException:
> Interrupted while waiting for IO on channel
> java.nio.channels.SocketChannel[connected local=/10.196.146.114:9003
> remote=/10.215.153.105:38559]. 60000 millis timeout left. at
> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:342)
> at
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
> ...
>
> {code}
>
> This log is very misleading because a 60000 mills timeout left implies that
> the DataXceiver never blocks in select(), and this is unrealistic. in fact,
> the truly timeout mills left should be: 60000 - timeElapsedWhenSelect.
> Finally, a properly log should be like this:
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]