[ 
https://issues.apache.org/jira/browse/ARROW-12030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17306188#comment-17306188
 ] 

Antoine Pitrou commented on ARROW-12030:
----------------------------------------

bq. Maybe one column has a complex filter on it (a string column with a regex 
filter) or takes more time encoding/decoding (a dictionary column where we have 
to hash inputs).  Anywhere the processing time is variable I think we may 
eventually want some buffering capability. 
bq. 
I'm not sure why we would want that. If we're talking about CPU-bound 
operations then the important thing is that the CPU remains busy. A thread pool 
should ensure that naturally.


> Change dataset readahead to be based on available RAM/CPU instead of fixed 
> constants/options
> --------------------------------------------------------------------------------------------
>
>                 Key: ARROW-12030
>                 URL: https://issues.apache.org/jira/browse/ARROW-12030
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Weston Pace
>            Assignee: Weston Pace
>            Priority: Major
>
> Right now in the dataset scanning there are a few places where we add 
> readahead.  At each spot we have to pick some max for how much we read ahead. 
>  Instead of trying to figure out some max it might be nicer to base it on the 
> available RAM.
> On the other hand, it may be the case that there is some set of nice 
> constants that just always works so this can probably wait until we 
> understand more the memory usage of dataset scanning.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to