[
https://issues.apache.org/jira/browse/RATIS-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17849603#comment-17849603
]
Ivan Andika edited comment on RATIS-2089 at 5/27/24 2:42 AM:
-------------------------------------------------------------
Thank you for checking this out, glad to see that this might be useful.
I thought of this improvement while analyzing the Ratis block write process and
the upper bound latency for block write in case of failure (e.g. 1 datanode is
down/unreachable). We found three relevant watch configurations that might be
too high:
* Client-side watch timeout: hdds.ratis.raft.client.rpc.watch.request.timeout
** Ratis equivalent: raft.client.rpc.watch.request.timeout
** Ratis default: 10s
** Ozone default: 180s
* Server-side watch timeout (This is the one that throws
NotReplicatedException): hdds.ratis.raft.server.watch.timeout
** Ratis equivalent: raft.server.watch.timeout
** Ratis default: 10s
** Ozone default: 180s
* Cleint-side watch retry timeout: hdds.ratis.client.request.watch.timeout
** No Ratis equivalent. This is used in RequestTypeDependentRetryPolicyCreator
to decide whether to throw RaftRetryFailureException
** I don't think this configuration is used frequently since WATCH request
retry policy is RetryPolicies.noRetry(), since NotReplicatedException have
NO_RETRY policy (only ResourceUnavailableException will be retried).
** Ozone default: 3m
All of them have the same values. Most likely the client-side watch timeout
will always be triggered since the client-side watch timeout starts first. This
might cause NotReplicatedException to never be caught by the client since
client timeout happened before server timeout. I believe we need to reduce the
server-side watch timeout significantly (e.g. 30s or 10s) so that
NotReplicatedException can be thrown faster. Only then, the chance of
NotReplicatedException caught will be higher and we can use the
CommitInfoProto. [~smeng] May I know whether your Ozone cluster changed these
configurations and whether there are latency improvements?
I will try to come up with a simple patch for RATIS-2089 soon. Will take a look
at HDDS-10108.
[~szetszwo] [~smeng] Also, may I ask that was the reason of the ALL_COMMITTED
watch requests on the block write in the first place? Alternatively, can we
instead release the client buffer immediately after successful
WriteChunk/PutBlock request instead? My understanding is that the reply from
Ratis server implies MAJORITY_COMMITTED and therefore there is no need to keep
the buffer since it's going to be replicated to all the Datanode Ratis server
eventually. Is it to ensure that when client (XceiverClientGrpc) read from any
datanode, all the data would have been replicated to all the datanodes and
therefore prevent BCSID_MISMATCH during BlockManagerImpl#getBlock?
was (Author: JIRAUSER298977):
Thank you for checking this out, glad to see that this might be useful.
I thought of this improvement while analyzing the Ratis block write process and
the upper bound latency for block write in case of failure (e.g. 1 datanode is
down/unreachable). We found three relevant watch configurations that might be
too high:
* Client-side watch timeout: hdds.ratis.raft.client.rpc.watch.request.timeout
** Ratis equivalent: raft.client.rpc.watch.request.timeout
** Ratis default: 10s
** Ozone default: 180s
* Server-side watch timeout (This is the one that throws
NotReplicatedException): hdds.ratis.raft.server.watch.timeout
** Ratis equivalent: raft.server.watch.timeout
** Ratis default: 10s
** Ozone default: 180s
* Cleint-side watch retry timeout: hdds.ratis.client.request.watch.timeout
** No Ratis equivalent. This is used in RequestTypeDependentRetryPolicyCreator
to decide whether to throw RaftRetryFailureException
** I don't think this configuration is used frequently since WATCH request
retry policy is RetryPolicies.noRetry(), since NotReplicatedException have
NO_RETRY policy (only ResourceUnavailableException will be retried).
** Ozone default: 3m
All of them have the same values. Most likely the client-side watch timeout
will always be triggered since the client-side watch timeout starts first. This
might cause NotReplicatedException to never be caught by the client since
client timeout happened before server timeout. I believe we need to reduce the
server-side watch timeout significantly (e.g. 30s or 10s) so that
NotReplicatedException can be thrown faster. Only then, the chance of
NotReplicatedException caught will be higher and we can use the
CommitInfoProto. [~smeng] May I know whether your Ozone cluster changed these
configurations and whether there are latency improvements? I will try to come
up with a simple patch for RATIS-2089 soon.
[~szetszwo] [~smeng] Also, may I ask that was the reason of the ALL_COMMITTED
watch requests on the block write in the first place? Alternatively, can we
instead release the client buffer immediately after successful
WriteChunk/PutBlock request instead? My understanding is that the reply from
Ratis server implies MAJORITY_COMMITTED and therefore there is no need to keep
the buffer since it's going to be replicated to all the Datanode Ratis server
eventually. Is it to ensure that when client (XceiverClientGrpc) read from any
datanode, all the data would have been replicated to all the datanodes and
therefore prevent BCSID_MISMATCH during BlockManagerImpl#getBlock?
> Add CommitInfoProto in NotReplicatedException
> ---------------------------------------------
>
> Key: RATIS-2089
> URL: https://issues.apache.org/jira/browse/RATIS-2089
> Project: Ratis
> Issue Type: Improvement
> Reporter: Ivan Andika
> Assignee: Ivan Andika
> Priority: Minor
>
> In Ozone's XceiverClientRatis#watchForCommit, there are two watch commits
> request with different ReplicationLevel
> # Watch for ALL_COMMITTED
> # Watch for MAJORITY_COMMITTED (If the previous watch threw an exception)
> Based on the second watch request, the client will remove some failed
> datanode UUID from the commitInfoMap.
> The second watch might not be necessary since the entries in
> AbstractCommitWatcher.commitIndexMap implies that the PutBlock request has
> been committed to the majority of the servers. Therefore, another
> MAJORITY_COMMITTED watch might not be necessary. From my understanding, the
> second MAJORITY_COMMITTED only serves to gain information to remove entries
> from commitInfoMap.
> If the first watch failed with NotReplicatedException, we might be able to
> remove the need to a second watch request. Since NotReplicatedException is a
> Raft server exception, we can include the CommitInfoProtos in the
> NotReplicatedException. The client can use this CommitInfoProtos to remove
> the entry from commitInfoMap without sending another WATCH request.
> This CommitInfoProto is returned for every RaftClientReply
> (RaftClientReply.commitInfos), but if there is an exception, it seems the
> RaftClientReply is not accessible to the client.
> However, if the exception is a client exception (e.g. due to Raft client
> watch timeout configuration), the client might have no choice but to send
> another watch request.
> So in this patch, I propose to include CommitInfoProto into
> NotReplicatedException.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)