GitHub user xintongsong added a comment to the discussion: [Discussion] Agent observability: tracing & evaluation beyond metrics + event log
Hi @addu390, +1 on this. Being able to reconstruct and visualize the trace of a single run is really helpful for observing and understanding agent behavior, and it's exactly the gap that neither metrics nor the event log fills today. I'd suggest splitting this into two fairly independent problems: - **Recording:** keep enough information in the event log to reconstruct the causal tree. - **Reconstruction & visualization:** rebuild and render a single run from that information. Once you split it this way, the overhead concern mostly lands on the first part, and I think that part is actually pretty light. We already record every event in the event log. To reconstruct the causal tree, we basically just need one extra field per event: which action emitted the event. With that, the whole run can be rebuilt from the event log. The reverse edges (which actions an event triggered) don't even need to be recorded explicitly. They can be derived from the action trigger rules plus timestamps. So the increment on the recording side is small, and I think it can be kept separate from the concern about full-fidelity tracing being too heavy at streaming QPS. The second part can be fully on-demand rather than always-on. We only run it when needed, e.g. local debugging, CI/eval, or staging. The rendering form doesn't have to be a tree either; PlantUML or something else would work too, and we can discuss that separately. As long as the recording side captures the necessary info, there's a lot of flexibility in what we do with it later. GitHub link: https://github.com/apache/flink-agents/discussions/710#discussioncomment-17088983 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected]
