[ 
https://issues.apache.org/jira/browse/HADOOP-16677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xudong Cao updated HADOOP-16677:
--------------------------------
    Labels: pull-request-available  (was: )

> Recalculate the remaining timeout millis correctly while throwing an 
> InterupptedException in SocketIOWithTimeout.
> -----------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-16677
>                 URL: https://issues.apache.org/jira/browse/HADOOP-16677
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: common
>    Affects Versions: 3.1.3
>            Reporter: Xudong Cao
>            Assignee: Xudong Cao
>            Priority: Minor
>              Labels: pull-request-available
>
> In SocketIOWithTimeout, when a thread was interrupted and exit from select(), 
> it proceed to throw an InterruptedIOException, in exception message the 
> remaining timeout mills should be recalcuated correctly rather than simply 
> give a total timeout millis ,  otherwise it could be very misleading.
> For example, if an hdfs writer has not sent any packet to the pipeline more 
> than 60s (e.g. full gc or network issues),  then one of pipeline datanodes 
> may be timeout and close its sockets to other dns,  so its upstream DN's 
> PacketResponder will immediately meet an EOF and then interrupt its own 
> DataXceiver, finally its DataXeiver will print some logs like:
> {code:java}
> 2019-10-24 09:22:58,212 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Exception for 
> BP-753871533-10.215.131.216-1511957392115:blk_10646544613_95736750382019-10-24
>  09:22:58,212 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception 
> for BP-753871533-10.215.131.216-1511957392115:blk_10646544613_9573675038
> java.io.InterruptedIOException: Interrupted while waiting for IO on channel 
> java.nio.channels.SocketChannel[connected local=/10.196.146.114:9003 
> remote=/10.215.153.105:38559]. 60000 millis timeout left.
>  at 
> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:342)
>  at 
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
>  ...
>  
> {code}
> However, This log is wrong and very misleading, a 60000 mills timeout left 
> implies that the DataXceiver has never blocked in select(),  but this is not 
> the actual situation,  In fact, the truly timeout mills left is 60000 - 
> timeElapsedWhenSelect, rather than simply 60000.
> Finally, a properly log should be like this, the key improvment is "*Total 
> timeout mills is 60000, 1000 millis timeout left*":  
> {code:java}
> 2019-10-31 16:20:39,496 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Exception for 
> BP-1895911599-192.168.202.11-1572488735889:blk_1073741833_10162019-10-31 
> 16:20:39,496 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception 
> for BP-1895911599-192.168.202.11-1572488735889:blk_1073741833_1016
> java.io.InterruptedIOException: Interrupted while waiting for IO on channel 
> java.nio.channels.SocketChannel[connected local=/192.168.202.12:50010 
> remote=/192.168.202.11:57006]. Total timeout mills is 60000, 1000 millis 
> timeout left.
>  at 
> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:351)
>  at 
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
>  ...
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to