liuyanyu created HDFS-15407: ------------------------------- Summary: Hedged read will not work if a datanode slow for a long time Key: HDFS-15407 URL: https://issues.apache.org/jira/browse/HDFS-15407 Project: Hadoop HDFS Issue Type: Bug Components: 3.1.1, datanode Affects Versions: 3.1.1 Reporter: liuyanyu Assignee: liuyanyu
I use cgroups to limit the datanode node IO to 1024Byte/s, use hedged read to read the file, (where dfs.client.hedged.read.threadpool.size is set to 5, dfs.client.hedged.read.threshold.millis is set to 500), the first 5 buffer read timeout, switch other datenode nodes to read successfully. Then stuck for a long time because of SocketTimeoutException. Log as follows 2020-06-11 16:40:07,832 | INFO | main | Waited 500ms to read from DatanodeInfoWithStorage[xx.xx.xx.28:25009,DS-9c843ac6-4ea1-4791-a1af-54c1ae3d5daf,DISK]; spawning hedged read | DFSInputStream.java:1188 2020-06-11 16:40:08,562 | INFO | main | Waited 500ms to read from DatanodeInfoWithStorage[xx.xx.xx.28:25009,DS-9c843ac6-4ea1-4791-a1af-54c1ae3d5daf,DISK]; spawning hedged read | DFSInputStream.java:1188 2020-06-11 16:40:09,102 | INFO | main | Waited 500ms to read from DatanodeInfoWithStorage[xx.xx.xx.28:25009,DS-9c843ac6-4ea1-4791-a1af-54c1ae3d5daf,DISK]; spawning hedged read | DFSInputStream.java:1188 2020-06-11 16:40:09,642 | INFO | main | Waited 500ms to read from DatanodeInfoWithStorage[xx.xx.xx.28:25009,DS-9c843ac6-4ea1-4791-a1af-54c1ae3d5daf,DISK]; spawning hedged read | DFSInputStream.java:1188 2020-06-11 16:40:10,182 | INFO | main | Waited 500ms to read from DatanodeInfoWithStorage[xx.xx.xx.28:25009,DS-9c843ac6-4ea1-4791-a1af-54c1ae3d5daf,DISK]; spawning hedged read | DFSInputStream.java:1188 2020-06-11 16:40:10,182 | INFO | main | Execution rejected, Executing in current thread | DFSClient.java:3049 2020-06-11 16:40:10,219 | INFO | main | Execution rejected, Executing in current thread | DFSClient.java:3049 2020-06-11 16:50:07,638 | WARN | hedgedRead-0 | I/O error constructing remote block reader. | BlockReaderFactory.java:764 java.net.SocketTimeoutException: 600000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/xx.xx.xx.113:62750 remote=/xx.xx.xx.28:25009] at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118) at java.io.FilterInputStream.read(FilterInputStream.java:83) at org.apache.hadoop.hdfs.protocolPB.PBHelperClient.vintPrefixed(PBHelperClient.java:551) at org.apache.hadoop.hdfs.client.impl.BlockReaderRemote.newBlockReader(BlockReaderRemote.java:418) at org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReader(BlockReaderFactory.java:853) at org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:749) at org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.build(BlockReaderFactory.java:379) at org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:661) at org.apache.hadoop.hdfs.DFSInputStream.actualGetFromOneDataNode(DFSInputStream.java:1063) at org.apache.hadoop.hdfs.DFSInputStream$2.call(DFSInputStream.java:1035) at org.apache.hadoop.hdfs.DFSInputStream$2.call(DFSInputStream.java:1031) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 2020-06-11 16:50:07,638 | WARN | hedgedRead-0 | Connection failure: Failed to connect to /xx.xx.xx.28:25009 for file /testhdfs/test2.jar for block BP-1820384660-xx.xx.xx.74-1585533043013:blk_1082582662_8861386:java.net.SocketTimeoutException: 600000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/xx.xx.xx.113:62750 remote=/xx.xx.xx.28:25009] | DFSInputStream.java:1118 -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org