[ https://issues.apache.org/jira/browse/ARROW-12030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17305769#comment-17305769 ]
Weston Pace commented on ARROW-12030: ------------------------------------- I think there are two readahead parameters and they can be options. They are "how many files" to read at once and "how many blocks within a file" (fileReadahead and blockReadahead). The other "readaheads" I was thinking about are less readahead and more queuing points. For example, if you hit a slow file (maybe it's S3 or maybe some files are in the page cache and some aren't) then you can't proceed if you want to maintain order. You can put backpressure on and halt the whole chain or you can queue up results and allow the earlier stages to keep running. The size of this queue is difficult to predict. It is based on how much slower the slow file is and how much faster the downstream (decoding & concatenating) is. For example, let's assume reading a file from page cache is 100ns and from disk is 1000ns and you can do all your downstream operations in 10ns. If one out of every 100 files is on disk then you need to buffer 9 or 10 blocks and you'll need to keep them in the buffer for about 100ns. On the other hand, we could take the other extreme and let our queues be unlimited but then if we have a dataset where our downstream is slower than our upstream that queue will grow without bound and it won't really be "streaming" anymore. This is where a RAM limit is less guesswork than trying to figure out the ideal limits to all of the potential queuing points. This concern is not limited to just file ordering and it isn't limited to just the scan portion of the execution graph. Maybe one column has a complex filter on it (a string column with a regex filter) or takes more time encoding/decoding (a dictionary column where we have to hash inputs). Anywhere the processing time is variable I think we may eventually want some buffering capability. For example, in quickstep, there is buffering between every single node in the graph. > Change dataset readahead to be based on available RAM/CPU instead of fixed > constants/options > -------------------------------------------------------------------------------------------- > > Key: ARROW-12030 > URL: https://issues.apache.org/jira/browse/ARROW-12030 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ > Reporter: Weston Pace > Assignee: Weston Pace > Priority: Major > > Right now in the dataset scanning there are a few places where we add > readahead. At each spot we have to pick some max for how much we read ahead. > Instead of trying to figure out some max it might be nicer to base it on the > available RAM. > On the other hand, it may be the case that there is some set of nice > constants that just always works so this can probably wait until we > understand more the memory usage of dataset scanning. -- This message was sent by Atlassian Jira (v8.3.4#803005)