alamb commented on pull request #8473:
URL: https://github.com/apache/arrow/pull/8473#issuecomment-710096379


   > A stream of record batches will only return the next batch after the 
previous one was (async) computed. Therefore, we can't parallelize, as we are 
blocked by the execution of the previous batch. Therefore, this PR is only 
useful for situations on which we are reading batches that we may need to wait 
for.
   
   @jorgecarleitao  -- I personally think leaving the interface to be 
`Stream<RecordBatch>` is the way to go. 
   
   Among other things, this interface allows "backpressure" so that we don't 
end up buffering intermediate results if one operator (e.g. filter) can produce 
record batches than its consumer can process them (e.g. aggregate). 
   
   If we want to add additional parallelism with a tunable cost of additional 
buffering, we could keep the interface of this PR and add an operator like 
"Buffer". Buffer would calculate and store some number of `RecordBatches`, 
perhaps in a channel, as separate tasks in advance of when a batch was 
requested. 
   
   Then the user of DataFusion could control how much more buffering they were 
willing to accept in order to get more parallelism. 
   
   I think an approach like https://github.com/apache/arrow/pull/8480 is cool, 
but the choice of the aggregate, for example, to compute the input as fast as 
possible in new tasks, even if the aggregate
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to