[ 
https://issues.apache.org/jira/browse/ARROW-12030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17307378#comment-17307378
 ] 

Weston Pace commented on ARROW-12030:
-------------------------------------

Let's assume there is a pipeline, A-B-C-D.  A is async I/O and B, C, and D are 
all sync compute operations.  By default, in the pull model, if you only add 
readahead between A and B then you will never have parallelism on B-C-D 
(fan-out or pipeline).  Your main thread will simply be calling Collect(D) 
which looks like...
{code:java}
while True:
  results.append(await D())
{code}
D(N) will always follow C(N) and D(N+1) will always follow D(N).  You will have 
a serial sequence of B(1),C(1),D(1),B(2)...,D(N)

This may be ok or it may not be ok.  Maybe C is task that supports natural 
fan-out.  Whether you do a map-reduce style fork-join or an async generator 
style parallel readahead (both achieve the same result) you are adding a buffer.

 

A push model doesn't actually need buffering or blocking.  ReactiveX works this 
way out of the box.  When an upstream task finishes it synchronously triggers a 
downstream task.  Some operators (e.g. combine) implicitly require buffering in 
the same way some async generator operators (e.g. merge or sequence).  I don't 
know that I've mentally mapped all these thoughts.  However, a basic chain of 
mapped tasks wouldn't require it.  Then, just like in the push model, you get 
to pick where to put the buffering/blocking.

> [C++] Change dataset readahead to be based on available RAM/CPU instead of 
> fixed constants/options
> --------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-12030
>                 URL: https://issues.apache.org/jira/browse/ARROW-12030
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Weston Pace
>            Assignee: Weston Pace
>            Priority: Major
>
> Right now in the dataset scanning there are a few places where we add 
> readahead.  At each spot we have to pick some max for how much we read ahead. 
>  Instead of trying to figure out some max it might be nicer to base it on the 
> available RAM.
> On the other hand, it may be the case that there is some set of nice 
> constants that just always works so this can probably wait until we 
> understand more the memory usage of dataset scanning.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to