SunQiang created HBASE-28752: -------------------------------- Summary: wal.AsyncFSWAL: sync failed Key: HBASE-28752 URL: https://issues.apache.org/jira/browse/HBASE-28752 Project: HBase Issue Type: Improvement Components: asyncclient, wal Affects Versions: 2.2.5, 2.1.10 Reporter: SunQiang
Our HBase system is used for OLAP , The client has strict requirements for latency and stability, and the client configuration is as follows: {code:java} hbase.rpc.timeout: 100 hbase.client.operation.timeout: 500 hbase.client.retries.number: 3 hbase.client.pause: 120 {code} When I logged off the Datanode, I received this exception: {code:java} 2024-06-03 17:19:16,535 WARN [RpcServer.default.RWQ.Fifo.read.handler=216,queue=4,port=16020] hdfs.BlockReaderFactory: I/O error constructing remote block reader. org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=/10.111.242.219:50010] at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:534) at org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:3436) at org.apache.hadoop.hdfs.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:777) at org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:694) at org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:355) at org.apache.hadoop.hdfs.DFSInputStream.actualGetFromOneDataNode(DFSInputStream.java:1173) at org.apache.hadoop.hdfs.DFSInputStream.access$200(DFSInputStream.java:92) at org.apache.hadoop.hdfs.DFSInputStream$2.call(DFSInputStream.java:1118) at org.apache.hadoop.hdfs.DFSInputStream$2.call(DFSInputStream.java:1110) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor$CallerRunsPolicy.rejectedExecution(ThreadPoolExecutor.java:2022) at org.apache.hadoop.hdfs.DFSClient$2.rejectedExecution(DFSClient.java:3481) at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823) at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1369) at java.util.concurrent.ExecutorCompletionService.submit(ExecutorCompletionService.java:181) at org.apache.hadoop.hdfs.DFSInputStream.hedgedFetchBlockByteRange(DFSInputStream.java:1297) {code} This will cause the HBase service to become unstable because HBase has accessed an offline datanode node, resulting in a long time required to create a socket connection to the offline datanode. Through stack logs, I found that it is controlled through the configuration of hdfs.client.socket timeout. -- In hbase-site.xml,I found that adjusting the {color:#FF0000}hdfs.client.socket-time{color} configuration is effective,so I turned down the hdfs.client.socket time configuration from 60s to 5s. but I found that if I continued to turn down the hdfs.client.socket time configuration to {color:#FF0000}200ms{color}, the following exception occurred: {code:java} 2024-06-18 15:51:24,212 WARN [AsyncFSWAL-0] wal. AsyncFSWAL: sync failed java.io.IOException: Timeout(200ms) waiting for response .{code} The configuration of 'hdfs. client. socket time' is reused in the FanOutOneBlockAsyncDFSOutput.class of hbase. -- In the 'FanOutOneBlockAsyncDFSOutput' construction method: {code:java} FanOutOneBlockAsyncDFSOutput(Configuration conf, FSUtils fsUtils, DistributedFileSystem dfs, DFSClient client, ClientProtocol namenode, String clientName, String src, long fileId, LocatedBlock locatedBlock, Encryptor encryptor, List<Channel> datanodeList, DataChecksum summer, ByteBufAllocator alloc) { this.conf = conf; this.fsUtils = fsUtils; this.dfs = dfs; this.client = client; this.namenode = namenode; this.fileId = fileId; this.clientName = clientName; this.src = src; this.block = locatedBlock.getBlock(); this.locations = locatedBlock.getLocations(); this.encryptor = encryptor; this.datanodeList = datanodeList; this.summer = summer; this.maxDataLen = MAX_DATA_LEN - (MAX_DATA_LEN % summer.getBytesPerChecksum()); this.alloc = alloc; this.buf = alloc.directBuffer(sendBufSizePRedictor.initialSize()); this.state = State.STREAMING; setupReceiver(conf.getInt(DFS_CLIENT_SOCKET_TIMEOUT_KEY, READ_TIMEOUT)); } {code} ---- My implementation process: 1. add a new configuration in hbase-site.xml {code:java} + <property> + <name>hbase.wal.asyncfsoutput.timeout</name> + <value>60000</value> + </property> {code} 2.modify code {code:java} 151 + private static final String FANOUT_TIMEOUTKEY = "hbase.wal.asyncfsoutput.timeout"; 339 - setupReceiver(conf.getInt(DFS_CLIENT_SOCKET_TIMEOUT_KEY, READ_TIMEOUT)); 339 + setupReceiver(conf.getInt(FANOUT_TIMEOUTKEY, READ_TIMEOUT)); {code} 3.repackge --- I would like to consult the community on whether it is possible to design a separate configuration item in FanOutOneBlockAsynchronousDFSOutput.jva to isolate the issue of writing Wal timeout failure due to the small size of dfs.comient.socket time? -- This message was sent by Atlassian Jira (v8.20.10#820010)