westonpace commented on PR #33738:
URL: https://github.com/apache/arrow/pull/33738#issuecomment-1400423789
> Is there a way to go back to making most of these spans siblings again?
Yes. The recursion is probably somewhat accurate but I agree it makes it
harder to read. Having them as siblings makes sense too. I will revert back
to this understanding.
> Do we want to change the organization of the spans in this PR from having
1 span for each node in the graph, each having a span for every chunk of data
it processes (how it was before), to having a ProcessMorsel for each chunk of
data, each having a span for each node it traverses through?
Yes, I believe so. A more generic term than `ProcessMorsel` would be
"pipeline" or "plan fragment". Acero is (implicitly) a "plan fragment" engine
in that we have one task per batch per fragment. Thinking about it this way it
would be nice if we had something like the conceptual model:
* ProcessMorsel
* Filter
* Project
* Sink
But today it ends up being (because the last part of each node is to call
the downstream node):
* ProcessMorsel
* Filter
* Project
* Sink
Yet another case where the conceptual/logical understanding is different
than the physical understanding. Although, perhaps there is some merit for the
physical understanding as it mirrors reality more closely. Perhaps it depends
on the goal of the reader. If someone is trying to improve the threading and
execution of the plan itself they might want the physical model. If someone is
trying to focus on the performance of a single node they might want the logical
model. I'll leave that for a follow-up PR.
> I think I can help in a follow-up PR, especially for the dataset writer.
Great. Mentally, when I think of the dataset writer, I think there are two
parts. The first part should be the trailing part of the fragment/pipeline
that feeds the writer. In this first part we partition the batch, select the
appropriate file queues, and deposit the batches into the queues. There is
then a separate dedicated thread task to write each batch to the writer.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]