[ 
https://issues.apache.org/jira/browse/RATIS-2189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17898452#comment-17898452
 ] 

Ivan Andika edited comment on RATIS-2189 at 11/15/24 1:21 AM:
--------------------------------------------------------------

[~szetszwo] Thanks for checking this out. 

> Which version of Ozone/Ratis are you testing? We saw a similar problem 
> previously. Not sure if it is the same.

We are using Ozone 1.4.1 and Ratis 3.1.1. May I know which problem you are 
referring to?

Currently, we are exploring using Ozone as the S3 remote storage for Kafka 
tiered storage feature. However, during stress test we found that when Kafka 
increased the amount of write and read workload to the S3G, it seems that 
Datanode threw "java.lang.OutOfMemoryError: Direct buffer memory" in the 
datanodes, causing write to be stuck with the following stacktrace that I 
assume means that it was intefering with Hadoop RPC that also uses direct 
buffer memory.
{code:java}
java.lang.OutOfMemoryError: Direct buffer memory
        at java.nio.Bits.reserveMemory(Bits.java:695)
        at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:123)
        at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311)
        at sun.nio.ch.Util.getTemporaryDirectBuffer(Util.java:241)
        at sun.nio.ch.IOUtil.write(IOUtil.java:58)
        at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:468)
        at 
org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:62)
        at 
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
        at 
org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:158)
        at 
org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:116)
        at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122)
        at java.io.DataOutputStream.write(DataOutputStream.java:107)
        at java.io.FilterOutputStream.write(FilterOutputStream.java:97)
        at org.apache.hadoop.ipc.Client$IpcStreams.sendRequest(Client.java:1930)
        at 
org.apache.hadoop.ipc.Client$Connection$RpcRequestSender.run(Client.java:1113)
        at java.lang.Thread.run(Thread.java:748){code}
We also encountered some TimeoutIOException on client and DN side. 
{code:java}
2024-11-13 11:33:43,537 [NettyClientStreamRpc-workerGroup--thread121] ERROR 
org.apache.ratis.client.impl.OrderedStreamAsync: Failed to send request, 
header=DataStreamRequestHeader:clientId=client-53089C94F05D,type=STREAM_DATA,id=139369,offset=65011712,length=1048576
java.util.concurrent.CompletionException: 
org.apache.ratis.protocol.exceptions.TimeoutIOException: Timeout 10s: Failed to 
send 
DataStreamRequestByteBuffer:clientId=client-53089C94F05D,type=STREAM_DATA,id=139369,offset=65011712,length=1048576
 via channel [id: 0x5ac464c1, L:/10.80.133.23:42750 - 
R:10.80.135.22/10.80.135.22:9855]
        at 
java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
        at 
java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
        at 
java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:607)
        at 
java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
        at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
        at 
java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
        at 
org.apache.ratis.netty.client.NettyClientStreamRpc.lambda$null$1(NettyClientStreamRpc.java:470)
        at 
org.apache.ratis.thirdparty.io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)
        at 
org.apache.ratis.thirdparty.io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:153)
        at 
org.apache.ratis.thirdparty.io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:173)
        at 
org.apache.ratis.thirdparty.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:166)
        at 
org.apache.ratis.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
        at 
org.apache.ratis.thirdparty.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:405)
        at 
org.apache.ratis.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
        at 
org.apache.ratis.thirdparty.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.ratis.protocol.exceptions.TimeoutIOException: Timeout 
10s: Failed to send 
DataStreamRequestByteBuffer:clientId=client-53089C94F05D,type=STREAM_DATA,id=139369,offset=65011712,length=1048576
 via channel [id: 0x5ac464c1, L:/10.80.133.23:42750 - 
R:10.80.135.22/10.80.135.22:9855]
        ... 10 more
2024-11-13 11:33:43,537 [NettyClientStreamRpc-workerGroup--thread121] ERROR 
org.apache.ratis.client.impl.OrderedStreamAsync: Failed to send request, 
header=DataStreamRequestHeader:clientId=client-53089C94F05D,type=STREAM_DATA,id=139369,offset=66060288,length=1048576
java.util.concurrent.CompletionException: java.lang.IllegalStateException: 
Request{streamOffset=66060288, type=STREAM_DATA}, : 
Request{streamOffset=65011712, type=STREAM_DATA} failed
        at 
java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
        at 
java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
        at 
java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:607)
        at 
java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
        at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
        at 
java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
        at 
org.apache.ratis.netty.client.NettyClientReplies$ReplyEntry.completeExceptionally(NettyClientReplies.java:167)
        at 
org.apache.ratis.netty.client.NettyClientReplies$ReplyMap.completeExceptionally(NettyClientReplies.java:86)
        at 
org.apache.ratis.netty.client.NettyClientReplies$ReplyMap.failAll(NettyClientReplies.java:92)
        at 
org.apache.ratis.netty.client.NettyClientReplies$ReplyMap.fail(NettyClientReplies.java:97)
        at 
org.apache.ratis.netty.client.NettyClientStreamRpc.lambda$null$1(NettyClientStreamRpc.java:472)
        at 
org.apache.ratis.thirdparty.io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)
        at 
org.apache.ratis.thirdparty.io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:153)
        at 
org.apache.ratis.thirdparty.io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:173)
        at 
org.apache.ratis.thirdparty.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:166)
        at 
org.apache.ratis.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
        at 
org.apache.ratis.thirdparty.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:405)
        at 
org.apache.ratis.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
        at 
org.apache.ratis.thirdparty.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IllegalStateException: Request{streamOffset=66060288, 
type=STREAM_DATA}, : Request{streamOffset=65011712, type=STREAM_DATA} failed
        ... 12 more
2024-11-13 11:33:43,537 [qtp582300198-205] WARN 
org.apache.hadoop.hdds.scm.storage.BlockDataStreamOutput: Failed to write all 
chunks through stream: java.util.concurrent.ExecutionException: 
org.apache.ratis.protocol.exceptions.TimeoutIOException: Timeout 10s: Failed to 
send 
DataStreamRequestByteBuffer:clientId=client-53089C94F05D,type=STREAM_DATA,id=139369,offset=65011712,length=1048576
 via channel [id: 0x5ac464c1, L:/10.80.133.23:42750 - 
R:10.80.135.22/10.80.135.22:9855] {code}
May I know whether this is expected when the datanodes are fully utilized due 
to the high read and write traffics? 

Is there any kind of backpressure mechanism to ensure that the clients (S3Gs) 
traffics are not overwhelming the datanodes? I saw from 
[https://github.com/netty/netty/pull/6662] that might be able to use 
channelWritabilityChanged to throttle the writes.


was (Author: JIRAUSER298977):
[~szetszwo] Thanks for checking this out. 

> Which version of Ozone/Ratis are you testing? We saw a similar problem 
> previously. Not sure if it is the same.

We are using Ozone 1.4.1 and Ratis 3.1.1. May I know which problem you are 
referring to?

Currently, we are exploring using Ozone as the S3 remote storage for Kafka 
tiered storage feature. However, during stress test we found that when Kafka 
increased the amount of write and read workload to the S3G, it seems that 
Datanode threw "java.lang.OutOfMemoryError: Direct buffer memory" in the 
datanodes, causing write to be stuck with the following stacktrace that I 
assume means that it was intefering with Hadoop RPC that also uses direct 
buffer memory.
{code:java}
java.lang.OutOfMemoryError: Direct buffer memory
        at java.nio.Bits.reserveMemory(Bits.java:695)
        at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:123)
        at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311)
        at sun.nio.ch.Util.getTemporaryDirectBuffer(Util.java:241)
        at sun.nio.ch.IOUtil.write(IOUtil.java:58)
        at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:468)
        at 
org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:62)
        at 
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
        at 
org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:158)
        at 
org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:116)
        at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122)
        at java.io.DataOutputStream.write(DataOutputStream.java:107)
        at java.io.FilterOutputStream.write(FilterOutputStream.java:97)
        at org.apache.hadoop.ipc.Client$IpcStreams.sendRequest(Client.java:1930)
        at 
org.apache.hadoop.ipc.Client$Connection$RpcRequestSender.run(Client.java:1113)
        at java.lang.Thread.run(Thread.java:748){code}
We also encountered some TimeoutIOException on client and DN side. 
{code:java}
2024-11-13 11:33:43,537 [NettyClientStreamRpc-workerGroup--thread121] ERROR 
org.apache.ratis.client.impl.OrderedStreamAsync: Failed to send request, 
header=DataStreamRequestHeader:clientId=client-53089C94F05D,type=STREAM_DATA,id=139369,offset=65011712,length=1048576
java.util.concurrent.CompletionException: 
org.apache.ratis.protocol.exceptions.TimeoutIOException: Timeout 10s: Failed to 
send 
DataStreamRequestByteBuffer:clientId=client-53089C94F05D,type=STREAM_DATA,id=139369,offset=65011712,length=1048576
 via channel [id: 0x5ac464c1, L:/10.80.133.23:42750 - 
R:10.80.135.22/10.80.135.22:9855]
        at 
java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
        at 
java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
        at 
java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:607)
        at 
java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
        at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
        at 
java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
        at 
org.apache.ratis.netty.client.NettyClientStreamRpc.lambda$null$1(NettyClientStreamRpc.java:470)
        at 
org.apache.ratis.thirdparty.io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)
        at 
org.apache.ratis.thirdparty.io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:153)
        at 
org.apache.ratis.thirdparty.io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:173)
        at 
org.apache.ratis.thirdparty.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:166)
        at 
org.apache.ratis.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
        at 
org.apache.ratis.thirdparty.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:405)
        at 
org.apache.ratis.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
        at 
org.apache.ratis.thirdparty.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.ratis.protocol.exceptions.TimeoutIOException: Timeout 
10s: Failed to send 
DataStreamRequestByteBuffer:clientId=client-53089C94F05D,type=STREAM_DATA,id=139369,offset=65011712,length=1048576
 via channel [id: 0x5ac464c1, L:/10.80.133.23:42750 - 
R:10.80.135.22/10.80.135.22:9855]
        ... 10 more
2024-11-13 11:33:43,537 [NettyClientStreamRpc-workerGroup--thread121] ERROR 
org.apache.ratis.client.impl.OrderedStreamAsync: Failed to send request, 
header=DataStreamRequestHeader:clientId=client-53089C94F05D,type=STREAM_DATA,id=139369,offset=66060288,length=1048576
java.util.concurrent.CompletionException: java.lang.IllegalStateException: 
Request{streamOffset=66060288, type=STREAM_DATA}, : 
Request{streamOffset=65011712, type=STREAM_DATA} failed
        at 
java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
        at 
java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
        at 
java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:607)
        at 
java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
        at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
        at 
java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
        at 
org.apache.ratis.netty.client.NettyClientReplies$ReplyEntry.completeExceptionally(NettyClientReplies.java:167)
        at 
org.apache.ratis.netty.client.NettyClientReplies$ReplyMap.completeExceptionally(NettyClientReplies.java:86)
        at 
org.apache.ratis.netty.client.NettyClientReplies$ReplyMap.failAll(NettyClientReplies.java:92)
        at 
org.apache.ratis.netty.client.NettyClientReplies$ReplyMap.fail(NettyClientReplies.java:97)
        at 
org.apache.ratis.netty.client.NettyClientStreamRpc.lambda$null$1(NettyClientStreamRpc.java:472)
        at 
org.apache.ratis.thirdparty.io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)
        at 
org.apache.ratis.thirdparty.io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:153)
        at 
org.apache.ratis.thirdparty.io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:173)
        at 
org.apache.ratis.thirdparty.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:166)
        at 
org.apache.ratis.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
        at 
org.apache.ratis.thirdparty.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:405)
        at 
org.apache.ratis.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
        at 
org.apache.ratis.thirdparty.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IllegalStateException: Request{streamOffset=66060288, 
type=STREAM_DATA}, : Request{streamOffset=65011712, type=STREAM_DATA} failed
        ... 12 more
2024-11-13 11:33:43,537 [qtp582300198-205] WARN 
org.apache.hadoop.hdds.scm.storage.BlockDataStreamOutput: Failed to write all 
chunks through stream: java.util.concurrent.ExecutionException: 
org.apache.ratis.protocol.exceptions.TimeoutIOException: Timeout 10s: Failed to 
send 
DataStreamRequestByteBuffer:clientId=client-53089C94F05D,type=STREAM_DATA,id=139369,offset=65011712,length=1048576
 via channel [id: 0x5ac464c1, L:/10.80.133.23:42750 - 
R:10.80.135.22/10.80.135.22:9855] {code}
May I know whether this is expected when the datanodes are fully utilized due 
to the high read and write traffics?

> Use ByteBufAllocator#ioBuffer in NettyDataStreamUtils
> -----------------------------------------------------
>
>                 Key: RATIS-2189
>                 URL: https://issues.apache.org/jira/browse/RATIS-2189
>             Project: Ratis
>          Issue Type: Improvement
>          Components: Streaming
>            Reporter: Ivan Andika
>            Priority: Minor
>
> Currently, NettyDataStreamUtils uses ByteBufAllocator#directBuffer which 
> forces all ByteBufAllocator to allocate direct buffer even for 
> PreferHeapByteBufAllocator (e.g. when we set 
> -Dorg.apache.ratis.thirdparty.io.netty.noPreferDirect=true).
> It's better to use ioBuffer and delegates to the actual ByteBufAllocator to 
> the type of memory it will use.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to