[ 
https://issues.apache.org/jira/browse/ARROW-12030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17305769#comment-17305769
 ] 

Weston Pace commented on ARROW-12030:
-------------------------------------

I think there are two readahead parameters and they can be options.  They are 
"how many files" to read at once and "how many blocks within a file" 
(fileReadahead and blockReadahead).

 

The other "readaheads" I was thinking about are less readahead and more queuing 
points.  For example, if you hit a slow file (maybe it's S3 or maybe some files 
are in the page cache and some aren't) then you can't proceed if you want to 
maintain order.  You can put backpressure on and halt the whole chain or you 
can queue up results and allow the earlier stages to keep running.  The size of 
this queue is difficult to predict.  It is based on how much slower the slow 
file is and how much faster the downstream (decoding & concatenating) is.  For 
example, let's assume reading a file from page cache is 100ns and from disk is 
1000ns and you can do all your downstream operations in 10ns.  If one out of 
every 100 files is on disk then you need to buffer 9 or 10 blocks and you'll 
need to keep them in the buffer for about 100ns.

 

On the other hand, we could take the other extreme and let our queues be 
unlimited but then if we have a dataset where our downstream is slower than our 
upstream that queue will grow without bound and it won't really be "streaming" 
anymore.

 

This is where a RAM limit is less guesswork than trying to figure out the ideal 
limits to all of the potential queuing points.

 

This concern is not limited to just file ordering and it isn't limited to just 
the scan portion of the execution graph.  Maybe one column has a complex filter 
on it (a string column with a regex filter) or takes more time 
encoding/decoding (a dictionary column where we have to hash inputs).  Anywhere 
the processing time is variable I think we may eventually want some buffering 
capability.  For example, in quickstep, there is buffering between every single 
node in the graph.

> Change dataset readahead to be based on available RAM/CPU instead of fixed 
> constants/options
> --------------------------------------------------------------------------------------------
>
>                 Key: ARROW-12030
>                 URL: https://issues.apache.org/jira/browse/ARROW-12030
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Weston Pace
>            Assignee: Weston Pace
>            Priority: Major
>
> Right now in the dataset scanning there are a few places where we add 
> readahead.  At each spot we have to pick some max for how much we read ahead. 
>  Instead of trying to figure out some max it might be nicer to base it on the 
> available RAM.
> On the other hand, it may be the case that there is some set of nice 
> constants that just always works so this can probably wait until we 
> understand more the memory usage of dataset scanning.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to