[ 
https://issues.apache.org/jira/browse/RATIS-2389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18056750#comment-18056750
 ] 

Tak-Lon (Stephen) Wu commented on RATIS-2389:
---------------------------------------------

Hi [~tanxinyu], thanks for your reviewing and comments, and your observation is 
correct and let me response inline below.
{quote}As a possible follow-up improvement, it might be worth considering 
modeling retries on the client side as child spans, and capturing more detailed 
critical paths in the server-side read and write workflows
{quote}
I am still getting up to speed on Ratis, but I want to clarify the scope of the 
retries you mentioned. Are we referring to {{sendRequestWithRetry}} (used by 
{{{}AsyncImpl#sendReadOnlyUnordered{}}}) or the operations governed by 
RetryPolicies? If so, should we address this as a follow-up task if time 
permits. 

Regarding server-side tracing, my current thinking is to implement these as 
*INTERNAL* span kinds (maybe also as followup). This would provide much-needed 
visibility into Log Appenders & Replication, Snapshots, Internal state 
transitions, etc.

Ultimately, we will rely on user and community feedback to identify further 
checkpoints or improvements needed for the tracing pipeline as it evolves. and 
thanks your suggestion again.
{quote}Regarding trace data delivery and configuration, it may be helpful if 
the documentation
{quote}
The sub-task RATIS-2396 should cover partially about the documentation request, 
and it's good reminder for me to include some information about OpenTelemetry 
trace being collected and sent. 

To shortly answer your concern, the collected trace data are sent 
asynchronously, sample ratio, batch size, target host could be configurable.
{quote}would it be possible to visualize the tracing results using Jaeger and 
include some screenshots or links (for example, showing the read and write 
paths) in the PR or related documentation?
{quote}
If you look at the [1] that we traced AsyncImpl#send that support 
`sendReadOnly`, `sendReadAfterWrite`, and etc. The PoC result was captured by 
running the FireStore client and [^PoC-result-span-detail.png] was visual by 
Jaeger UI (same as your request) that attached when JIRA created. E.g. in the 
`RW/org.apache.ratis.protocol.RaftClientRequest/sendRequestAsync` is a "Write" 
request where `RW` was the `TypeCase` in the code for `Write`.
{code:java}
  CompletableFuture<RaftClientReply> send(
      RaftClientRequest.Type type, Message message, RaftPeerId server) {
    final Supplier<Span> spanSupplier = new OperationSpanBuilder(server)
        .setOperationName("AsyncImpl::send")
        .setOperationType(type);
    return TraceUtils.tracedFuture(() -> client.getOrderedAsync().send(type, 
message, server),
        spanSupplier);
  }
{code}

I just put those pictures back on the documentation after this comment.

Reference

1. [https://github.com/taklwu/ratis/tree/opentelemetry0129]

> Implementing Opentelemetry Tracing in Apache Ratis
> --------------------------------------------------
>
>                 Key: RATIS-2389
>                 URL: https://issues.apache.org/jira/browse/RATIS-2389
>             Project: Ratis
>          Issue Type: New Feature
>          Components: client, server
>    Affects Versions: 3.3.0
>            Reporter: Tak-Lon (Stephen) Wu
>            Assignee: Tak-Lon (Stephen) Wu
>            Priority: Minor
>         Attachments: PoC-result-span-detail.png, PoC-result.png
>
>
> This proposal outlines the addition of OpenTelemetry support to Ratis. By 
> instrumenting the full client-side request path, we can empower users and 
> maintainers with the granular data necessary for both long-term performance 
> optimization and proactive daily monitoring.
>  * 1-pager proposal: 
> [https://docs.google.com/document/d/1UKGVqOzkAXqUAJxOz1RHq6fIiO3xqV57eIqi-f9qdE4/edit?tab=t.0#heading=h.5a3u31wlm0n]
>  * PoC: [https://github.com/taklwu/ratis/tree/opentelemetry0129]
> Subtasks
>  * Define the Metadata Field: Modify RaftRpcMessage.proto to include an 
> optional SpanContext field.
>  * Add TraceUtil: Land the utility class in ratis-common based on the code 
> you see in HBase.
>  * Create the client span: Introduce the span supplier and CLIENT span hook.
>  * Instrument GRPC on the Server: Start with the GRPC module as it is the 
> most common transport. Instrument the onNext methods (or within the caller) 
> to start/stop spans.
>  * Come up with the user guide as part of the release. 
> Reference
> 1. HBase Tracing with Opentelemetry, 
> [HBASE-22120|https://issues.apache.org/jira/browse/HBASE-22120]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to