[
https://issues.apache.org/jira/browse/ARROW-12030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17307378#comment-17307378
]
Weston Pace commented on ARROW-12030:
-------------------------------------
Let's assume there is a pipeline, A-B-C-D. A is async I/O and B, C, and D are
all sync compute operations. By default, in the pull model, if you only add
readahead between A and B then you will never have parallelism on B-C-D
(fan-out or pipeline). Your main thread will simply be calling Collect(D)
which looks like...
{code:java}
while True:
results.append(await D())
{code}
D(N) will always follow C(N) and D(N+1) will always follow D(N). You will have
a serial sequence of B(1),C(1),D(1),B(2)...,D(N)
This may be ok or it may not be ok. Maybe C is task that supports natural
fan-out. Whether you do a map-reduce style fork-join or an async generator
style parallel readahead (both achieve the same result) you are adding a buffer.
A push model doesn't actually need buffering or blocking. ReactiveX works this
way out of the box. When an upstream task finishes it synchronously triggers a
downstream task. Some operators (e.g. combine) implicitly require buffering in
the same way some async generator operators (e.g. merge or sequence). I don't
know that I've mentally mapped all these thoughts. However, a basic chain of
mapped tasks wouldn't require it. Then, just like in the push model, you get
to pick where to put the buffering/blocking.
> [C++] Change dataset readahead to be based on available RAM/CPU instead of
> fixed constants/options
> --------------------------------------------------------------------------------------------------
>
> Key: ARROW-12030
> URL: https://issues.apache.org/jira/browse/ARROW-12030
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Reporter: Weston Pace
> Assignee: Weston Pace
> Priority: Major
>
> Right now in the dataset scanning there are a few places where we add
> readahead. At each spot we have to pick some max for how much we read ahead.
> Instead of trying to figure out some max it might be nicer to base it on the
> available RAM.
> On the other hand, it may be the case that there is some set of nice
> constants that just always works so this can probably wait until we
> understand more the memory usage of dataset scanning.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)