[
https://issues.apache.org/jira/browse/HBASE-28752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17868867#comment-17868867
]
SunQiang commented on HBASE-28752:
----------------------------------
[~zhangduo] If FSHLog times out, the handler write thread will be interrupted
probably, and in severe cases, all the handler write threads will be exhausted
and unable to accept any write requests, resulting in all requests being
blocked and waiting.
> wal.AsyncFSWAL: sync failed
> ---------------------------
>
> Key: HBASE-28752
> URL: https://issues.apache.org/jira/browse/HBASE-28752
> Project: HBase
> Issue Type: Improvement
> Components: asyncclient, wal
> Affects Versions: 2.1.10, 2.2.5
> Reporter: SunQiang
> Priority: Major
>
> Our HBase system is used for OLAP , The client has strict requirements for
> latency and stability, and the client configuration is as follows:
> {code:java}
> hbase.rpc.timeout: 100
> hbase.client.operation.timeout: 500
> hbase.client.retries.number: 3
> hbase.client.pause: 120 {code}
> When I logged off the Datanode, I received this exception:
> {code:java}
> 2024-06-03 17:19:16,535 WARN
> [RpcServer.default.RWQ.Fifo.read.handler=216,queue=4,port=16020]
> hdfs.BlockReaderFactory: I/O error constructing remote block reader.
> org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while
> waiting for channel to be ready for connect. ch :
> java.nio.channels.SocketChannel[connection-pending
> remote=/10.111.242.219:50010]
> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:534)
> at org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:3436)
> at
> org.apache.hadoop.hdfs.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:777)
> at
> org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:694)
> at
> org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:355)
> at
> org.apache.hadoop.hdfs.DFSInputStream.actualGetFromOneDataNode(DFSInputStream.java:1173)
> at
> org.apache.hadoop.hdfs.DFSInputStream.access$200(DFSInputStream.java:92)
> at org.apache.hadoop.hdfs.DFSInputStream$2.call(DFSInputStream.java:1118)
> at org.apache.hadoop.hdfs.DFSInputStream$2.call(DFSInputStream.java:1110)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at
> java.util.concurrent.ThreadPoolExecutor$CallerRunsPolicy.rejectedExecution(ThreadPoolExecutor.java:2022)
> at
> org.apache.hadoop.hdfs.DFSClient$2.rejectedExecution(DFSClient.java:3481)
> at
> java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823)
> at
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1369)
> at
> java.util.concurrent.ExecutorCompletionService.submit(ExecutorCompletionService.java:181)
> at
> org.apache.hadoop.hdfs.DFSInputStream.hedgedFetchBlockByteRange(DFSInputStream.java:1297)
> {code}
> This will cause the HBase service to become unstable because HBase has
> accessed an offline datanode node, resulting in a long time required to
> create a socket connection to the offline datanode. Through stack logs, I
> found that it is controlled through the configuration of hdfs.client.socket
> timeout.
> –
> In hbase-site.xml,I found that adjusting the
> {color:#ff0000}hdfs.client.socket-time{color} configuration is effective,so I
> turned down the hdfs.client.socket time configuration from 60s to 5s. but I
> found that if I continued to turn down the hdfs.client.socket time
> configuration to {color:#ff0000}200ms{color}, the following exception
> occurred:
> {code:java}
> 2024-06-18 15:51:24,212 WARN [AsyncFSWAL-0] wal.AsyncFSWAL: sync failed
> java.io.IOException: Timeout(200ms) waiting for response
> at
> org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput$AckHandler.lambda$userEventTriggered$4(FanOutOneBlockAsyncDFSOutput.java:302)
> at
> org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.failed(FanOutOneBlockAsyncDFSOutput.java:235)
> at
> org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.access$300(FanOutOneBlockAsyncDFSOutput.java:98)
> at
> org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput$AckHandler.userEventTriggered(FanOutOneBlockAsyncDFSOutput.java:301)
> at
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeUserEventTriggered(AbstractChannelHandlerContext.java:326)
> at
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeUserEventTriggered(AbstractChannelHandlerContext.java:312)
> at
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireUserEventTriggered(AbstractChannelHandlerContext.java:304)
> at
> org.apache.hbase.thirdparty.io.netty.channel.ChannelInboundHandlerAdapter.userEventTriggered(ChannelInboundHandlerAdapter.java:108)
> at
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeUserEventTriggered(AbstractChannelHandlerContext.java:326)
> at
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeUserEventTriggered(AbstractChannelHandlerContext.java:312)
> at
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireUserEventTriggered(AbstractChannelHandlerContext.java:304)
> at
> org.apache.hbase.thirdparty.io.netty.channel.ChannelInboundHandlerAdapter.userEventTriggered(ChannelInboundHandlerAdapter.java:108)
> at
> org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.userEventTriggered(ByteToMessageDecoder.java:366)
> at
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeUserEventTriggered(AbstractChannelHandlerContext.java:326)
> at
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeUserEventTriggered(AbstractChannelHandlerContext.java:312)
> at
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireUserEventTriggered(AbstractChannelHandlerContext.java:304)
> at
> org.apache.hbase.thirdparty.io.netty.handler.timeout.IdleStateHandler.channelIdle(IdleStateHandler.java:371)
> at
> org.apache.hbase.thirdparty.io.netty.handler.timeout.IdleStateHandler$ReaderIdleTimeoutTask.run(IdleStateHandler.java:494)
> at
> org.apache.hbase.thirdparty.io.netty.handler.timeout.IdleStateHandler$AbstractIdleTask.run(IdleStateHandler.java:466)
> at
> org.apache.hbase.thirdparty.io.netty.util.concurrent.PromiseTask$RunnableAdapter.call(PromiseTask.java:38)
> at
> org.apache.hbase.thirdparty.io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:127)
> at
> org.apache.hbase.thirdparty.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
> at
> org.apache.hbase.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:404)
> at
> org.apache.hbase.thirdparty.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:333)
> at
> org.apache.hbase.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:905)
> at
> org.apache.hbase.thirdparty.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
> at java.lang.Thread.run(Thread.java:745){code}
>
> The configuration of 'hdfs. client. socket time' is reused in the
> FanOutOneBlockAsyncDFSOutput.class of hbase.
> –
> In the 'FanOutOneBlockAsyncDFSOutput' construction method:
> {code:java}
> FanOutOneBlockAsyncDFSOutput(Configuration conf, FSUtils fsUtils,
> DistributedFileSystem dfs,
> DFSClient client, ClientProtocol namenode, String clientName, String src,
> long fileId,
> LocatedBlock locatedBlock, Encryptor encryptor, List<Channel>
> datanodeList,
> DataChecksum summer, ByteBufAllocator alloc) {
> this.conf = conf;
> this.fsUtils = fsUtils;
> this.dfs = dfs;
> this.client = client;
> this.namenode = namenode;
> this.fileId = fileId;
> this.clientName = clientName;
> this.src = src;
> this.block = locatedBlock.getBlock();
> this.locations = locatedBlock.getLocations();
> this.encryptor = encryptor;
> this.datanodeList = datanodeList;
> this.summer = summer;
> this.maxDataLen = MAX_DATA_LEN - (MAX_DATA_LEN %
> summer.getBytesPerChecksum());
> this.alloc = alloc;
> this.buf = alloc.directBuffer(sendBufSizePRedictor.initialSize());
> this.state = State.STREAMING;
> * setupReceiver(conf.getInt(DFS_CLIENT_SOCKET_TIMEOUT_KEY, READ_TIMEOUT));
> } {code}
> ----
> My implementation process:
> 1. add a new configuration in hbase-site.xml
> {code:java}
> + <property>
> + <name>hbase.wal.asyncfsoutput.timeout</name>
> + <value>60000</value>
> + </property> {code}
> 2.modify source code with FanOutOneBlockAsyncDFSOutput.java
> {code:java}
> 151 + private static final String FANOUT_TIMEOUTKEY =
> "hbase.wal.asyncfsoutput.timeout";
> 339 - setupReceiver(conf.getInt(DFS_CLIENT_SOCKET_TIMEOUT_KEY, READ_TIMEOUT));
> 339 + setupReceiver(conf.getInt(FANOUT_TIMEOUTKEY, READ_TIMEOUT));
> {code}
> 3.repackge
>
> —
> I would like to consult the community on whether it is possible to design a
> separate configuration item in FanOutOneBlockAsynchronousDFSOutput.jva to
> isolate the issue of writing WalLog timeout failure due to the small size of
> dfs.comient.socket-time?
--
This message was sent by Atlassian Jira
(v8.20.10#820010)