[
https://issues.apache.org/jira/browse/HDDS-3994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17163506#comment-17163506
]
Lokesh Jain commented on HDDS-3994:
-----------------------------------
[~maobaolong] the new retry policy was added to provide more stability to
client writes. There were a couple of problems related to the old retry policy.
# There were too frequent retries made by the client at very low interval (1
second). This could increase load on the pipeline and increase memory and GC
pressure.
# The client request could be retried for a long period of time i.e. couple of
hours.
The new policy addressed these concerns with a well defined timeout for ratis
client write operation on a pipeline. Further exponential backoff policy was
added for addressing situations where pipeline is running low on resources. The
exponential backoff policy is used in only low resource situations.
I think we should try these new policies and tune the default values for these
configs. For the situation you have described I think we only need to tune
config hdds.ratis.client.multilinear.random.retry.policy.
Possible defaults to try here are
1s, 5, 10s, 5, 15s, 5, 20s, 5, 25s, 5, 60s, 10 and 1s, 10, 10s, 5, 15s, 5, 20s,
5, 25s, 5, 60s, 10. The other configs should not have any impact on the problem
you are seeing.
> Write object when met exception can be slower than before
> ---------------------------------------------------------
>
> Key: HDDS-3994
> URL: https://issues.apache.org/jira/browse/HDDS-3994
> Project: Hadoop Distributed Data Store
> Issue Type: Bug
> Components: Ozone Client
> Affects Versions: 0.6.0
> Reporter: maobaolong
> Assignee: maobaolong
> Priority: Major
> Labels: pull-request-available
>
> After HDDS-3350 , the retry policy changed, and the client write performance
> getting lower than before.
>
> With HDDS-3350, I restore the method RatisHelper#createRetryPolicy to the
> previous commit, it works well.
>
> The previous is
>
> {code:java}
> static RetryPolicy createRetryPolicy(ConfigurationSource conf) {
> int maxRetryCount =
> conf.getInt(OzoneConfigKeys.DFS_RATIS_CLIENT_REQUEST_MAX_RETRIES_KEY,
> OzoneConfigKeys.
> DFS_RATIS_CLIENT_REQUEST_MAX_RETRIES_DEFAULT);
> long retryInterval = conf.getTimeDuration(OzoneConfigKeys.
> DFS_RATIS_CLIENT_REQUEST_RETRY_INTERVAL_KEY, OzoneConfigKeys.
> DFS_RATIS_CLIENT_REQUEST_RETRY_INTERVAL_DEFAULT
> .toIntExact(TimeUnit.MILLISECONDS), TimeUnit.MILLISECONDS);
> TimeDuration sleepDuration =
> TimeDuration.valueOf(retryInterval, TimeUnit.MILLISECONDS);
> RetryPolicy retryPolicy = RetryPolicies
> .retryUpToMaximumCountWithFixedSleep(maxRetryCount, sleepDuration);
> return retryPolicy;
> }
> {code}
> When I switch logLevel to TRACE level, i see the following log While using
> HDDS-3350
> 2020-07-21 12:56:11,822 [grpc-default-executor-5] TRACE impl.OrderedAsync:
> client-6F623ADF656D: Failed*
> RaftClientRequest:client-6F623ADF656D->207b98d9-ad64-45a8-940f-504b514feff5@group-83A28012848F,
> cid=2876, seq=1*, Watch(0), null
> java.util.concurrent.CompletionException:
> org.apache.ratis.protocol.LeaderNotReadyException:
> 207b98d9-ad64-45a8-940f-504b514feff5@group-83A28012848F is in LEADER state
> but not ready yet.
> at
> java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
> at
> java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
> at
> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
> at
> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
> at
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
> at
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
> at
> org.apache.ratis.grpc.client.GrpcClientProtocolClient$AsyncStreamObservers.completeReplyExceptionally(GrpcClientProtocolClient.java:358)
> at
> org.apache.ratis.grpc.client.GrpcClientProtocolClient$AsyncStreamObservers.access$000(GrpcClientProtocolClient.java:264)
> at
> org.apache.ratis.grpc.client.GrpcClientProtocolClient$AsyncStreamObservers$1.onNext(GrpcClientProtocolClient.java:283)
> at
> org.apache.ratis.grpc.client.GrpcClientProtocolClient$AsyncStreamObservers$1.onNext(GrpcClientProtocolClient.java:269)
> at
> org.apache.ratis.thirdparty.io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onMessage(ClientCalls.java:436)
> at
> org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1MessagesAvailable.runInternal(ClientCallImpl.java:658)
> at
> org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1MessagesAvailable.runInContext(ClientCallImpl.java:643)
> at
> org.apache.ratis.thirdparty.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
> at
> org.apache.ratis.thirdparty.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.ratis.protocol.LeaderNotReadyException:
> 207b98d9-ad64-45a8-940f-504b514feff5@group-83A28012848F is in LEADER state
> but not ready yet.
> at
> org.apache.ratis.client.impl.ClientProtoUtils.toRaftClientReply(ClientProtoUtils.java:281)
> at
> org.apache.ratis.grpc.client.GrpcClientProtocolClient$AsyncStreamObservers$1.onNext(GrpcClientProtocolClient.java:274)
> ... 9 more
> 2020-07-21 12:56:11,822 [grpc-default-executor-5] DEBUG impl.OrderedAsync:
> schedule* attempt #1 with sleep 608ms and policy
> RequestTypeDependentRetryPolicy{WRITE->org.apache.ratis.retry.ExceptionDependentRetry@5685c509,
> WATCH->org.apache.ratis.retry.ExceptionDependentRetry@5572f5cd} for
> RaftClientRequest:client-6F623ADF656D->207b98d9-ad64-45a8-940f-504b514feff5@group-83A28012848F,
> cid=2876, seq=1*, Watch(0), null
> 2020-07-21 12:56:18,273
> [java.util.concurrent.ThreadPoolExecutor$Worker@157a0c99[State = -1, empty
> queue]] DEBUG impl.OrderedAsync: schedule* attempt #35 with sleep 4606ms and
> policy
> RequestTypeDependentRetryPolicy{WRITE->org.apache.ratis.retry.ExceptionDependentRetry@19a0a59,
> WATCH->org.apache.ratis.retry.ExceptionDependentRetry@41afac0b} for
> RaftClientRequest:client-F768C93F1755->207b98d9-ad64-45a8-940f-504b514feff5@group-83A28012848F,
> cid=2034, seq=1*, Watch(0), null
> 2020-07-21 12:56:18,820
> [java.util.concurrent.ThreadPoolExecutor$Worker@64953e21[State = -1, empty
> queue]] DEBUG impl.OrderedAsync: client-6F623ADF656D: send*
> RaftClientRequest:client-6F623ADF656D->207b98d9-ad64-45a8-940f-504b514feff5@group-83A28012848F,
> cid=2876, seq=1*, Watch(0), null
> 2020-07-21 12:56:18,821
> [java.util.concurrent.ThreadPoolExecutor$Worker@64953e21[State = -1, empty
> queue]] TRACE impl.OrderedAsync: client-6F623ADF656D: Failed*
> RaftClientRequest:client-6F623ADF656D->207b98d9-ad64-45a8-940f-504b514feff5@group-83A28012848F,
> cid=2876, seq=1*, Watch(0), null
> java.util.concurrent.CompletionException:
> org.apache.ratis.protocol.AlreadyClosedException:
> client-6F623ADF656D->207b98d9-ad64-45a8-940f-504b514feff5 is closed.
> at
> java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
> at
> java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
> at
> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
> at
> java.util.concurrent.CompletableFuture.uniApplyStage(CompletableFuture.java:614)
> at
> java.util.concurrent.CompletableFuture.thenApply(CompletableFuture.java:1983)
> at
> org.apache.ratis.client.impl.OrderedAsync.sendRequest(OrderedAsync.java:234)
> at
> org.apache.ratis.client.impl.OrderedAsync.sendRequestWithRetry(OrderedAsync.java:187)
> at
> org.apache.ratis.util.SlidingWindow$Client.sendOrDelayRequest(SlidingWindow.java:278)
> at
> org.apache.ratis.util.SlidingWindow$Client.retry(SlidingWindow.java:294)
> at
> org.apache.ratis.client.impl.OrderedAsync.lambda$scheduleWithTimeout$7(OrderedAsync.java:220)
> at
> org.apache.ratis.util.TimeoutScheduler.lambda$onTimeout$0(TimeoutScheduler.java:141)
> at
> org.apache.ratis.util.TimeoutScheduler.lambda$onTimeout$1(TimeoutScheduler.java:155)
> at org.apache.ratis.util.LogUtils.runAndLog(LogUtils.java:38)
> at org.apache.ratis.util.LogUtils$1.run(LogUtils.java:79)
> at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.ratis.protocol.AlreadyClosedException:
> client-6F623ADF656D->207b98d9-ad64-45a8-940f-504b514feff5 is closed.
> at
> org.apache.ratis.grpc.client.GrpcClientProtocolClient$AsyncStreamObservers.onNext(GrpcClientProtocolClient.java:313)
> at
> org.apache.ratis.grpc.client.GrpcClientRpc.sendRequestAsync(GrpcClientRpc.java:68)
> at
> org.apache.ratis.client.impl.OrderedAsync.sendRequest(OrderedAsync.java:233)
> ... 15 more
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]