[GitHub] [arrow] joosthooz commented on pull request #34168: GH-33880: [C++] Improve I/O tracing

via GitHub Mon, 17 Jul 2023 07:54:49 -0700


joosthooz commented on PR #34168:
URL: https://github.com/apache/arrow/pull/34168#issuecomment-1638313332

@mapleFU I think this is ready for some feedback. This PR requires some
additional context. I've focused mostly on the Dataset writers and readers, and
I think that some of the "normal" readers/writers are still not yet
instrumented with spans. I think we should leave that out of scope. As noted in
the description above, this PR is a follow-up to a previous PR that changed the
structure of traces. Before, there were top-level spans for each graph node,
and each item of data would have a child span underneath each of those.
Here, the goal is to use spans' parent-child relation to show how morsels
flow through the graph:

![image](https://github.com/apache/arrow/assets/1442581/ae897152-7276-4223-92c7-0d19bae78fea)
These morsels are produced by the readers:

![image](https://github.com/apache/arrow/assets/1442581/384ca71b-e733-40c9-93d1-16aca3afd743)
But because of how the code is written, I have not been able to connect
these together. In the ideal world, there would be a connection between
ReadNextAsync -> MakeChunkedBatchGenerator -> ProcessMorsel. But that would
require more invasive code changes (passing along a span context with an
ExecBatch, for example).

![image](https://github.com/apache/arrow/assets/1442581/891ba9cd-c0b2-4536-9015-bfc23c5b1881)
What I'd still like to do is somehow add an attribute to certain spans whose
duration can be aggregated to get some sensible total (like CPU time spent on
the read side, each kernel, and the write side of things). And then I've been
working on a notebook to visualize a trace into a per-thread activity overview,
it would be nice if we can add something like that but I'm not sure how.
Looking forward to any feedback

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] joosthooz commented on pull request #34168: GH-33880: [C++] Improve I/O tracing

Reply via email to