[ 
https://issues.apache.org/jira/browse/ARROW-12030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17306729#comment-17306729
 ] 

Weston Pace commented on ARROW-12030:
-------------------------------------

> I'm not sure why we would want that. If we're talking about CPU-bound 
> operations then the important thing is that the CPU remains busy. A thread 
> pool should ensure that naturally.

I may not be describing this well.  The back-pressure is going to halt the I/O 
(and potentially other CPU operations).  It won't stop the bottleneck.  For 
example, assume you are reading a feather file that is 30GB from an SSD on a 2 
core 8GB AWS instance.  Your goal is to filter on a string column using a regex 
or do some other CPU intensive computation (unique or an expensive cast / 
decoding or maybe many such operations).

Now let's pretend we can read at 1000MB/s but the two cores working together 
can only process this operation at 400MB/s.  Every second you are growing RAM 
by 600MB and in ~15 seconds you will have only processed part of the file but 
you will have run out of RAM.

On the other hand, since your I/O is so much faster than your CPU, you don't 
really need to readahead all that much.  You can probably keep a 10MB buffer.  
The entire operation will still take 75 seconds but instead of taking over 8GB 
of RAM you only use 10MB.

Now, you could argue that the I/O readahead in the scan node should be 
sufficient.  However, imagine the graph is Scan->Filter->HashAgg->Output and 
the slow node is HashAgg.  The filtering node will consume the I/O quickly so 
the I/O readahead buffer won't fill up.  However, the Filter node will keep 
delivering batches to the HashAgg node and they will start to stockpile and 
build up.

> [C++] Change dataset readahead to be based on available RAM/CPU instead of 
> fixed constants/options
> --------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-12030
>                 URL: https://issues.apache.org/jira/browse/ARROW-12030
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Weston Pace
>            Assignee: Weston Pace
>            Priority: Major
>
> Right now in the dataset scanning there are a few places where we add 
> readahead.  At each spot we have to pick some max for how much we read ahead. 
>  Instead of trying to figure out some max it might be nicer to base it on the 
> available RAM.
> On the other hand, it may be the case that there is some set of nice 
> constants that just always works so this can probably wait until we 
> understand more the memory usage of dataset scanning.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to