liuyanyu created HDFS-15407:
-------------------------------
Summary: Hedged read will not work if a datanode slow for a long
time
Key: HDFS-15407
URL: https://issues.apache.org/jira/browse/HDFS-15407
Project: Hadoop HDFS
Issue Type: Bug
Components: 3.1.1, datanode
Affects Versions: 3.1.1
Reporter: liuyanyu
Assignee: liuyanyu
I use cgroups to limit the datanode node IO to 1024Byte/s, use hedged read to
read the file, (where dfs.client.hedged.read.threadpool.size is set to 5,
dfs.client.hedged.read.threshold.millis is set to 500), the first 5 buffer read
timeout, switch other datenode nodes to read successfully. Then stuck for a
long time because of SocketTimeoutException. Log as follows
2020-06-11 16:40:07,832 | INFO | main | Waited 500ms to read from
DatanodeInfoWithStorage[xx.xx.xx.28:25009,DS-9c843ac6-4ea1-4791-a1af-54c1ae3d5daf,DISK];
spawning hedged read | DFSInputStream.java:1188
2020-06-11 16:40:08,562 | INFO | main | Waited 500ms to read from
DatanodeInfoWithStorage[xx.xx.xx.28:25009,DS-9c843ac6-4ea1-4791-a1af-54c1ae3d5daf,DISK];
spawning hedged read | DFSInputStream.java:1188
2020-06-11 16:40:09,102 | INFO | main | Waited 500ms to read from
DatanodeInfoWithStorage[xx.xx.xx.28:25009,DS-9c843ac6-4ea1-4791-a1af-54c1ae3d5daf,DISK];
spawning hedged read | DFSInputStream.java:1188
2020-06-11 16:40:09,642 | INFO | main | Waited 500ms to read from
DatanodeInfoWithStorage[xx.xx.xx.28:25009,DS-9c843ac6-4ea1-4791-a1af-54c1ae3d5daf,DISK];
spawning hedged read | DFSInputStream.java:1188
2020-06-11 16:40:10,182 | INFO | main | Waited 500ms to read from
DatanodeInfoWithStorage[xx.xx.xx.28:25009,DS-9c843ac6-4ea1-4791-a1af-54c1ae3d5daf,DISK];
spawning hedged read | DFSInputStream.java:1188
2020-06-11 16:40:10,182 | INFO | main | Execution rejected, Executing in
current thread | DFSClient.java:3049
2020-06-11 16:40:10,219 | INFO | main | Execution rejected, Executing in
current thread | DFSClient.java:3049
2020-06-11 16:50:07,638 | WARN | hedgedRead-0 | I/O error constructing remote
block reader. | BlockReaderFactory.java:764
java.net.SocketTimeoutException: 600000 millis timeout while waiting for
channel to be ready for read. ch : java.nio.channels.SocketChannel[connected
local=/xx.xx.xx.113:62750 remote=/xx.xx.xx.28:25009]
at
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118)
at java.io.FilterInputStream.read(FilterInputStream.java:83)
at
org.apache.hadoop.hdfs.protocolPB.PBHelperClient.vintPrefixed(PBHelperClient.java:551)
at
org.apache.hadoop.hdfs.client.impl.BlockReaderRemote.newBlockReader(BlockReaderRemote.java:418)
at
org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReader(BlockReaderFactory.java:853)
at
org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:749)
at
org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.build(BlockReaderFactory.java:379)
at
org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:661)
at
org.apache.hadoop.hdfs.DFSInputStream.actualGetFromOneDataNode(DFSInputStream.java:1063)
at
org.apache.hadoop.hdfs.DFSInputStream$2.call(DFSInputStream.java:1035)
at
org.apache.hadoop.hdfs.DFSInputStream$2.call(DFSInputStream.java:1031)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2020-06-11 16:50:07,638 | WARN | hedgedRead-0 | Connection failure: Failed to
connect to /xx.xx.xx.28:25009 for file /testhdfs/test2.jar for block
BP-1820384660-xx.xx.xx.74-1585533043013:blk_1082582662_8861386:java.net.SocketTimeoutException:
600000 millis timeout while waiting for channel to be ready for read. ch :
java.nio.channels.SocketChannel[connected local=/xx.xx.xx.113:62750
remote=/xx.xx.xx.28:25009] | DFSInputStream.java:1118
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]