SunQiang created HBASE-28752:
--------------------------------

             Summary: wal.AsyncFSWAL: sync failed
                 Key: HBASE-28752
                 URL: https://issues.apache.org/jira/browse/HBASE-28752
             Project: HBase
          Issue Type: Improvement
          Components: asyncclient, wal
    Affects Versions: 2.2.5, 2.1.10
            Reporter: SunQiang


Our HBase system is used for OLAP , The client has strict requirements for 
latency and stability, and the client configuration is as follows:
{code:java}
hbase.rpc.timeout: 100
hbase.client.operation.timeout: 500
hbase.client.retries.number: 3
hbase.client.pause: 120 {code}
When I logged off the Datanode, I received this exception:
{code:java}
2024-06-03 17:19:16,535 WARN  
[RpcServer.default.RWQ.Fifo.read.handler=216,queue=4,port=16020] 
hdfs.BlockReaderFactory: I/O error constructing remote block reader.
org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while 
waiting for channel to be ready for connect. ch : 
java.nio.channels.SocketChannel[connection-pending remote=/10.111.242.219:50010]
    at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:534)
    at org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:3436)
    at 
org.apache.hadoop.hdfs.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:777)
    at 
org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:694)
    at 
org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:355)
    at 
org.apache.hadoop.hdfs.DFSInputStream.actualGetFromOneDataNode(DFSInputStream.java:1173)
    at org.apache.hadoop.hdfs.DFSInputStream.access$200(DFSInputStream.java:92)
    at org.apache.hadoop.hdfs.DFSInputStream$2.call(DFSInputStream.java:1118)
    at org.apache.hadoop.hdfs.DFSInputStream$2.call(DFSInputStream.java:1110)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at 
java.util.concurrent.ThreadPoolExecutor$CallerRunsPolicy.rejectedExecution(ThreadPoolExecutor.java:2022)
    at org.apache.hadoop.hdfs.DFSClient$2.rejectedExecution(DFSClient.java:3481)
    at 
java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823)
    at 
java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1369)
    at 
java.util.concurrent.ExecutorCompletionService.submit(ExecutorCompletionService.java:181)
    at 
org.apache.hadoop.hdfs.DFSInputStream.hedgedFetchBlockByteRange(DFSInputStream.java:1297)
 {code}
This will cause the HBase service to become unstable because HBase has accessed 
an offline datanode node, resulting in a long time required to create a socket 
connection to the offline datanode. Through stack logs, I found that it is 
controlled through the configuration of hdfs.client.socket timeout.

--

In hbase-site.xml,I found that adjusting the 
{color:#FF0000}hdfs.client.socket-time{color} configuration is effective,so I 
turned down the hdfs.client.socket time configuration from 60s to 5s. but I 
found that if I continued to turn down the hdfs.client.socket time 
configuration to {color:#FF0000}200ms{color}, the following exception occurred:
{code:java}
2024-06-18 15:51:24,212 WARN  [AsyncFSWAL-0] wal. AsyncFSWAL: sync failed
java.io.IOException: Timeout(200ms) waiting for response .{code}
 

The configuration of 'hdfs. client. socket time' is reused in the 
FanOutOneBlockAsyncDFSOutput.class of hbase.

--

In the 'FanOutOneBlockAsyncDFSOutput' construction method:
{code:java}
FanOutOneBlockAsyncDFSOutput(Configuration conf, FSUtils fsUtils, 
DistributedFileSystem dfs,
    DFSClient client, ClientProtocol namenode, String clientName, String src, 
long fileId,
    LocatedBlock locatedBlock, Encryptor encryptor, List<Channel> datanodeList,
    DataChecksum summer, ByteBufAllocator alloc) {
  this.conf = conf;
  this.fsUtils = fsUtils;
  this.dfs = dfs;
  this.client = client;
  this.namenode = namenode;
  this.fileId = fileId;
  this.clientName = clientName;
  this.src = src;
  this.block = locatedBlock.getBlock();
  this.locations = locatedBlock.getLocations();
  this.encryptor = encryptor;
  this.datanodeList = datanodeList;
  this.summer = summer;
  this.maxDataLen = MAX_DATA_LEN - (MAX_DATA_LEN % 
summer.getBytesPerChecksum());
  this.alloc = alloc;
  this.buf = alloc.directBuffer(sendBufSizePRedictor.initialSize());
  this.state = State.STREAMING;
  setupReceiver(conf.getInt(DFS_CLIENT_SOCKET_TIMEOUT_KEY, READ_TIMEOUT));
} {code}
----
My implementation process:
1. add a new configuration in hbase-site.xml
{code:java}
+ <property>
+ <name>hbase.wal.asyncfsoutput.timeout</name>
+ <value>60000</value>
+ </property> {code}
2.modify code
{code:java}
151 + private static final String FANOUT_TIMEOUTKEY = 
"hbase.wal.asyncfsoutput.timeout";
339 - setupReceiver(conf.getInt(DFS_CLIENT_SOCKET_TIMEOUT_KEY, READ_TIMEOUT));
339 + setupReceiver(conf.getInt(FANOUT_TIMEOUTKEY, READ_TIMEOUT));
 {code}
3.repackge
 
---
I would like to consult the community on whether it is possible to design a 
separate configuration item in FanOutOneBlockAsynchronousDFSOutput.jva to 
isolate the issue of writing Wal timeout failure due to the small size of 
dfs.comient.socket time?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to