[ https://issues.apache.org/jira/browse/ARROW-12030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17306188#comment-17306188 ]
Antoine Pitrou commented on ARROW-12030: ---------------------------------------- bq. Maybe one column has a complex filter on it (a string column with a regex filter) or takes more time encoding/decoding (a dictionary column where we have to hash inputs). Anywhere the processing time is variable I think we may eventually want some buffering capability. bq. I'm not sure why we would want that. If we're talking about CPU-bound operations then the important thing is that the CPU remains busy. A thread pool should ensure that naturally. > Change dataset readahead to be based on available RAM/CPU instead of fixed > constants/options > -------------------------------------------------------------------------------------------- > > Key: ARROW-12030 > URL: https://issues.apache.org/jira/browse/ARROW-12030 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ > Reporter: Weston Pace > Assignee: Weston Pace > Priority: Major > > Right now in the dataset scanning there are a few places where we add > readahead. At each spot we have to pick some max for how much we read ahead. > Instead of trying to figure out some max it might be nicer to base it on the > available RAM. > On the other hand, it may be the case that there is some set of nice > constants that just always works so this can probably wait until we > understand more the memory usage of dataset scanning. -- This message was sent by Atlassian Jira (v8.3.4#803005)