[
https://issues.apache.org/jira/browse/RATIS-2389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18056669#comment-18056669
]
Xinyu Tan commented on RATIS-2389:
----------------------------------
[~taklwu] Thank you for the proposal — it looks very promising overall.
At the moment, the tracing seems to focus on two high-level spans (client and
server). As a possible follow-up improvement, it might be worth considering
modeling retries on the client side as child spans, and capturing more detailed
critical paths in the server-side read and write workflows. This could help
provide deeper insights into potential read/write bottlenecks and guide future
optimizations.
Regarding trace data delivery and configuration, it may be helpful if the
documentation could further clarify how these aspects are defined and
configured, such as whether trace data is sent synchronously or asynchronously,
the batching strategy and batch size, the target host, and whether RPC
compression can be enabled.
For the current POC, before moving the PR into a formal review state, would it
be possible to visualize the tracing results using Jaeger and include some
screenshots or links (for example, showing the read and write paths) in the PR
or related documentation? This could help reviewers gain a clearer and more
concrete understanding of the POC’s behavior and effectiveness.
> Implementing Opentelemetry Tracing in Apache Ratis
> --------------------------------------------------
>
> Key: RATIS-2389
> URL: https://issues.apache.org/jira/browse/RATIS-2389
> Project: Ratis
> Issue Type: New Feature
> Components: client, server
> Affects Versions: 3.3.0
> Reporter: Tak-Lon (Stephen) Wu
> Assignee: Tak-Lon (Stephen) Wu
> Priority: Minor
> Attachments: PoC-result-span-detail.png, PoC-result.png
>
>
> This proposal outlines the addition of OpenTelemetry support to Ratis. By
> instrumenting the full client-side request path, we can empower users and
> maintainers with the granular data necessary for both long-term performance
> optimization and proactive daily monitoring.
> * 1-pager proposal:
> [https://docs.google.com/document/d/1UKGVqOzkAXqUAJxOz1RHq6fIiO3xqV57eIqi-f9qdE4/edit?tab=t.0#heading=h.5a3u31wlm0n]
> * PoC: [https://github.com/taklwu/ratis/tree/opentelemetry0129]
> Subtasks
> * Define the Metadata Field: Modify RaftRpcMessage.proto to include an
> optional SpanContext field.
> * Add TraceUtil: Land the utility class in ratis-common based on the code
> you see in HBase.
> * Create the client span: Introduce the span supplier and CLIENT span hook.
> * Instrument GRPC on the Server: Start with the GRPC module as it is the
> most common transport. Instrument the onNext methods (or within the caller)
> to start/stop spans.
> * Come up with the user guide as part of the release.
> Reference
> 1. HBase Tracing with Opentelemetry,
> [HBASE-22120|https://issues.apache.org/jira/browse/HBASE-22120]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)