joosthooz commented on PR #34168: URL: https://github.com/apache/arrow/pull/34168#issuecomment-1638313332
@mapleFU I think this is ready for some feedback. This PR requires some additional context. I've focused mostly on the Dataset writers and readers, and I think that some of the "normal" readers/writers are still not yet instrumented with spans. I think we should leave that out of scope. As noted in the description above, this PR is a follow-up to a previous PR that changed the structure of traces. Before, there were top-level spans for each graph node, and each item of data would have a child span underneath each of those. Here, the goal is to use spans' parent-child relation to show how morsels flow through the graph:  These morsels are produced by the readers:  But because of how the code is written, I have not been able to connect these together. In the ideal world, there would be a connection between ReadNextAsync -> MakeChunkedBatchGenerator -> ProcessMorsel. But that would require more invasive code changes (passing along a span context with an ExecBatch, for example).  What I'd still like to do is somehow add an attribute to certain spans whose duration can be aggregated to get some sensible total (like CPU time spent on the read side, each kernel, and the write side of things). And then I've been working on a notebook to visualize a trace into a per-thread activity overview, it would be nice if we can add something like that but I'm not sure how. Looking forward to any feedback -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
