westonpace commented on PR #33738:
URL: https://github.com/apache/arrow/pull/33738#issuecomment-1400423789

   >  Is there a way to go back to making most of these spans siblings again?
   
   Yes.  The recursion is probably somewhat accurate but I agree it makes it 
harder to read.  Having them as siblings makes sense too.  I will revert back 
to this understanding.
   
   >  Do we want to change the organization of the spans in this PR from having 
1 span for each node in the graph, each having a span for every chunk of data 
it processes (how it was before), to having a ProcessMorsel for each chunk of 
data, each having a span for each node it traverses through?
   
   Yes, I believe so.  A more generic term than `ProcessMorsel` would be 
"pipeline" or "plan fragment".  Acero is (implicitly) a "plan fragment" engine 
in that we have one task per batch per fragment.  Thinking about it this way it 
would be nice if we had something like the conceptual model:
   
    * ProcessMorsel
      * Filter
      * Project
      * Sink
   
   But today it ends up being (because the last part of each node is to call 
the downstream node):
   
    * ProcessMorsel
      * Filter
        * Project
          * Sink
   
   Yet another case where the conceptual/logical understanding is different 
than the physical understanding.  Although, perhaps there is some merit for the 
physical understanding as it mirrors reality more closely.  Perhaps it depends 
on the goal of the reader.  If someone is trying to improve the threading and 
execution of the plan itself they might want the physical model.  If someone is 
trying to focus on the performance of a single node they might want the logical 
model.  I'll leave that for a follow-up PR.
   
   >  I think I can help in a follow-up PR, especially for the dataset writer.
   
   Great.  Mentally, when I think of the dataset writer, I think there are two 
parts.  The first part should be the trailing part of the fragment/pipeline 
that feeds the writer.  In this first part we partition the batch, select the 
appropriate file queues, and deposit the batches into the queues.  There is 
then a separate dedicated thread task to write each batch to the writer.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to