westonpace commented on PR #13028:
URL: https://github.com/apache/arrow/pull/13028#issuecomment-1120079201

   The second thing you will often see mentioned is the "morsel / batch" model. 
 When reading data in you often want to read it in largish blocks of data 
(counter-intuitively, these large blocks are referred to as "morsels").  This 
can lead to an execution engine that is roughly:
   
   ```
   parallel for morsel in data_source:
     for operator in pipeline:
       morsel = operator(morsel)
     send_to_sink(morsel)
   ```
   
   However, since each operator is often going over the same data, and morsels 
are often bigger than CPU caches, this can be inefficient.  Instead the ideal 
approach is:
   
   ```
   for morsel in data_source:
     parallel for l2_batch in morsel:
       for operator in operators:
         l2_batch = operator(l2_batch)
       send_to_sink(l2_batch)
   ```
   
   This is the model we are trying to work towards in the current execution 
engine (the hash join is pretty close, the projection and filter nodes still 
have some work to do before they can handle small batches).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to