[ 
https://issues.apache.org/jira/browse/HADOOP-16677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xudong Cao updated HADOOP-16677:
--------------------------------
    Description: 
In SocketIOWithTimeout, when a thread was interrupted and exit from select(), 
it proceed to throw an InterruptedIOException, in exception message the 
remaining timeout mills should be recalcuated correctly rather than simply give 
a total timeout millis ,  otherwise it could be very misleading.

For example, if an hdfs writer has not sent any packet to the pipeline more 
than 60s (e.g. full gc or network issues),  then one of pipeline datanodes may 
be timeout and close its sockets to other dns,  so its upstream DN's 
PacketResponder will immediately meet an EOF and then interrupt its own 
DataXceiver, finally its DataXeiver will print some logs like:
{code:java}
2019-10-24 09:22:58,212 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Exception for 
BP-753871533-10.215.131.216-1511957392115:blk_10646544613_95736750382019-10-24 
09:22:58,212 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception 
for BP-753871533-10.215.131.216-1511957392115:blk_10646544613_9573675038
java.io.InterruptedIOException: Interrupted while waiting for IO on channel 
java.nio.channels.SocketChannel[connected local=/10.196.146.114:9003 
remote=/10.215.153.105:38559]. 60000 millis timeout left.
 at 
org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:342)
 at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
 ...
 
{code}
 

This log is very misleading because a 60000 mills timeout left implies that the 
DataXceiver never blocks in select(), and this is unrealistic.  in fact, the 
truly timeout mills left should be: 60000 - timeElapsedWhenSelect.

 

Finally, a properly log should be like this, the key improvment is "*Total 
timeout mills is 60000, 0 millis timeout left*":  
{code:java}
2019-10-31 16:20:39,496 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Exception for 
BP-1895911599-192.168.202.11-1572488735889:blk_1073741833_10162019-10-31 
16:20:39,496 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception 
for BP-1895911599-192.168.202.11-1572488735889:blk_1073741833_1016
java.io.InterruptedIOException: Interrupted while waiting for IO on channel 
java.nio.channels.SocketChannel[connected local=/192.168.202.12:50010 
remote=/192.168.202.11:57006]. Total timeout mills is 60000, 0 millis timeout 
left.
at 
org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:351)
at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
...
{code}

  was:
In SocketIOWithTimeout, when a thread was interrupted and exit from select(), 
it proceed to throw an InterruptedIOException, in exception message the 
remaining timeout mills should be recalcuated correctly rather than simply give 
a total timeout millis ,  otherwise it could be very misleading.

For example, if an hdfs writer has not sent any packet to the pipeline more 
than 60s (e.g. full gc or network issues),  then one of pipeline datanodes may 
be timeout and close its sockets to other dns,  so its upstream DN's 
PacketResponder will immediately meet an EOF and then interrupt its own 
DataXceiver, finally its DataXeiver will print some logs like:
{code:java}
2019-10-24 09:22:58,212 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Exception for 
BP-753871533-10.215.131.216-1511957392115:blk_10646544613_95736750382019-10-24 
09:22:58,212 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception 
for BP-753871533-10.215.131.216-1511957392115:blk_10646544613_9573675038
java.io.InterruptedIOException: Interrupted while waiting for IO on channel 
java.nio.channels.SocketChannel[connected local=/10.196.146.114:9003 
remote=/10.215.153.105:38559]. 60000 millis timeout left.
 at 
org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:342)
 at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
 ...
 
{code}
This log is very misleading because a 60000 mills timeout left implies that the 
DataXceiver never blocks in select(), and this is unrealistic.  in fact, the 
truly timeout mills left should be: 60000 - timeElapsedWhenSelect.

Finally, a properly log should be like this, the key improvment is "*Total 
timeout mills is 60000, 0 millis timeout left*":  
{code:java}
2019-10-31 16:20:39,496 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Exception for 
BP-1895911599-192.168.202.11-1572488735889:blk_1073741833_10162019-10-31 
16:20:39,496 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception 
for BP-1895911599-192.168.202.11-1572488735889:blk_1073741833_1016
java.io.InterruptedIOException: Interrupted while waiting for IO on channel 
java.nio.channels.SocketChannel[connected local=/192.168.202.12:50010 
remote=/192.168.202.11:57006]. Total timeout mills is 60000, 0 millis timeout 
left.
at 
org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:351)
at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
...
{code}


> Recalculate the remaining timeout millis correctly while throwing an 
> InterupptedException in SocketIOWithTimeout.
> -----------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-16677
>                 URL: https://issues.apache.org/jira/browse/HADOOP-16677
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: common
>    Affects Versions: 3.1.3
>            Reporter: Xudong Cao
>            Assignee: Xudong Cao
>            Priority: Minor
>
> In SocketIOWithTimeout, when a thread was interrupted and exit from select(), 
> it proceed to throw an InterruptedIOException, in exception message the 
> remaining timeout mills should be recalcuated correctly rather than simply 
> give a total timeout millis ,  otherwise it could be very misleading.
> For example, if an hdfs writer has not sent any packet to the pipeline more 
> than 60s (e.g. full gc or network issues),  then one of pipeline datanodes 
> may be timeout and close its sockets to other dns,  so its upstream DN's 
> PacketResponder will immediately meet an EOF and then interrupt its own 
> DataXceiver, finally its DataXeiver will print some logs like:
> {code:java}
> 2019-10-24 09:22:58,212 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Exception for 
> BP-753871533-10.215.131.216-1511957392115:blk_10646544613_95736750382019-10-24
>  09:22:58,212 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception 
> for BP-753871533-10.215.131.216-1511957392115:blk_10646544613_9573675038
> java.io.InterruptedIOException: Interrupted while waiting for IO on channel 
> java.nio.channels.SocketChannel[connected local=/10.196.146.114:9003 
> remote=/10.215.153.105:38559]. 60000 millis timeout left.
>  at 
> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:342)
>  at 
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
>  ...
>  
> {code}
>  
> This log is very misleading because a 60000 mills timeout left implies that 
> the DataXceiver never blocks in select(), and this is unrealistic.  in fact, 
> the truly timeout mills left should be: 60000 - timeElapsedWhenSelect.
>  
> Finally, a properly log should be like this, the key improvment is "*Total 
> timeout mills is 60000, 0 millis timeout left*":  
> {code:java}
> 2019-10-31 16:20:39,496 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Exception for 
> BP-1895911599-192.168.202.11-1572488735889:blk_1073741833_10162019-10-31 
> 16:20:39,496 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception 
> for BP-1895911599-192.168.202.11-1572488735889:blk_1073741833_1016
> java.io.InterruptedIOException: Interrupted while waiting for IO on channel 
> java.nio.channels.SocketChannel[connected local=/192.168.202.12:50010 
> remote=/192.168.202.11:57006]. Total timeout mills is 60000, 0 millis timeout 
> left.
> at 
> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:351)
> at 
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
> ...
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Reply via email to