[
https://issues.apache.org/jira/browse/RATIS-2389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18056750#comment-18056750
]
Tak-Lon (Stephen) Wu commented on RATIS-2389:
---------------------------------------------
Hi [~tanxinyu], thanks for your reviewing and comments, and your observation is
correct and let me response inline below.
{quote}As a possible follow-up improvement, it might be worth considering
modeling retries on the client side as child spans, and capturing more detailed
critical paths in the server-side read and write workflows
{quote}
I am still getting up to speed on Ratis, but I want to clarify the scope of the
retries you mentioned. Are we referring to {{sendRequestWithRetry}} (used by
{{{}AsyncImpl#sendReadOnlyUnordered{}}}) or the operations governed by
RetryPolicies? If so, should we address this as a follow-up task if time
permits.
Regarding server-side tracing, my current thinking is to implement these as
*INTERNAL* span kinds (maybe also as followup). This would provide much-needed
visibility into Log Appenders & Replication, Snapshots, Internal state
transitions, etc.
Ultimately, we will rely on user and community feedback to identify further
checkpoints or improvements needed for the tracing pipeline as it evolves. and
thanks your suggestion again.
{quote}Regarding trace data delivery and configuration, it may be helpful if
the documentation
{quote}
The sub-task RATIS-2396 should cover partially about the documentation request,
and it's good reminder for me to include some information about OpenTelemetry
trace being collected and sent.
To shortly answer your concern, the collected trace data are sent
asynchronously, sample ratio, batch size, target host could be configurable.
{quote}would it be possible to visualize the tracing results using Jaeger and
include some screenshots or links (for example, showing the read and write
paths) in the PR or related documentation?
{quote}
If you look at the [1] that we traced AsyncImpl#send that support
`sendReadOnly`, `sendReadAfterWrite`, and etc. The PoC result was captured by
running the FireStore client and [^PoC-result-span-detail.png] was visual by
Jaeger UI (same as your request) that attached when JIRA created. E.g. in the
`RW/org.apache.ratis.protocol.RaftClientRequest/sendRequestAsync` is a "Write"
request where `RW` was the `TypeCase` in the code for `Write`.
{code:java}
CompletableFuture<RaftClientReply> send(
RaftClientRequest.Type type, Message message, RaftPeerId server) {
final Supplier<Span> spanSupplier = new OperationSpanBuilder(server)
.setOperationName("AsyncImpl::send")
.setOperationType(type);
return TraceUtils.tracedFuture(() -> client.getOrderedAsync().send(type,
message, server),
spanSupplier);
}
{code}
I just put those pictures back on the documentation after this comment.
Reference
1. [https://github.com/taklwu/ratis/tree/opentelemetry0129]
> Implementing Opentelemetry Tracing in Apache Ratis
> --------------------------------------------------
>
> Key: RATIS-2389
> URL: https://issues.apache.org/jira/browse/RATIS-2389
> Project: Ratis
> Issue Type: New Feature
> Components: client, server
> Affects Versions: 3.3.0
> Reporter: Tak-Lon (Stephen) Wu
> Assignee: Tak-Lon (Stephen) Wu
> Priority: Minor
> Attachments: PoC-result-span-detail.png, PoC-result.png
>
>
> This proposal outlines the addition of OpenTelemetry support to Ratis. By
> instrumenting the full client-side request path, we can empower users and
> maintainers with the granular data necessary for both long-term performance
> optimization and proactive daily monitoring.
> * 1-pager proposal:
> [https://docs.google.com/document/d/1UKGVqOzkAXqUAJxOz1RHq6fIiO3xqV57eIqi-f9qdE4/edit?tab=t.0#heading=h.5a3u31wlm0n]
> * PoC: [https://github.com/taklwu/ratis/tree/opentelemetry0129]
> Subtasks
> * Define the Metadata Field: Modify RaftRpcMessage.proto to include an
> optional SpanContext field.
> * Add TraceUtil: Land the utility class in ratis-common based on the code
> you see in HBase.
> * Create the client span: Introduce the span supplier and CLIENT span hook.
> * Instrument GRPC on the Server: Start with the GRPC module as it is the
> most common transport. Instrument the onNext methods (or within the caller)
> to start/stop spans.
> * Come up with the user guide as part of the release.
> Reference
> 1. HBase Tracing with Opentelemetry,
> [HBASE-22120|https://issues.apache.org/jira/browse/HBASE-22120]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)