[GitHub] [arrow] westonpace opened a new pull request #12228: ARROW-14510: [C++][Datasets] Improve memory usage of datasets API when scanning parquet

GitBox Fri, 21 Jan 2022 18:39:08 -0800


westonpace opened a new pull request #12228:
URL: https://github.com/apache/arrow/pull/12228



   This PR changes a few things.
   
    * The default file readahead is changed to 4.  This doesn't seem to affect 
performance on HDD/SSD and users should already be doing special tuning for S3. 
 Besides, in many cases, users are reading IPC/Parquet files that have many row 
groups and so we already have sufficient I/O parallelism.  This is important 
for bringing down the overall memory usage as can be seen in the formula below.
    * The default batch readahead is changed to 4.  Previously, when we were 
doing filtering and projection within the scanner, it made sense to read many 
batches ahead (generally want at least 2 * # of CPUs in that case).  Now that 
the exec plan is doing the computation the exec plan buffering is instead 
handled by kDefaultBackpressureLow and kDefaultBackpressureHigh.
    * Moves around the parquet readahead a bit.  The previous version would 
read ahead N row groups.  Now we always read ahead exactly 1 row group but we 
read ahead N batches (this may mean that we read ahead more than 1 row group if 
the batch size is much larger than the row group size).
     * Changes the default scanner batch size to 64K.  Now that we have more or 
less decoupled the scanning batch size from the row group size we can pass 
smaller batches through the scanner.  This makes it easier to get parallelism 
on small datasets and, more importantly, it makes it so that we hit the exec 
plan backpressure limit more quickly.
   
   Putting this altogether the scanner should now buffer in memory:
   
   MAX(fragment_readahead * row_group_size_bytes * 2, fragment_readahead * 
batch_readahead * batch_size_bytes)
   
   The exec plan should buffer ~ max_backpressure * batch_size_bytes
   
   Adding those two together should give the total RAM usage of a plan without 
pipeline breakers (note that dataset write is effectively a pipeline breaker 
but it does its own backpressure / spillover).
   
   So, given the parquet dataset mentioned in the JIRA (21 files, 10 million 
rows each, 10 row groups each) and knowing that 1 row group is ~140MB when 
decompressed into Arrow format we should get the following default memory usage:
   
   Scanner readahead = MAX(4 * 140MB * 2, 4 * 4 * 17.5MB) = MAX(1120MB, 280MB) 
= 1120MB
   Plan readahead = 64 * 17.5MB = 1120MB
   Total RAM usage should then be about 2240MB.
   
   I've tested with the above dataset and an intentionally slow consumer and 
found peak RSS to be 2,220,496KB.  The RSS is a little sawtoothy as the row 
group get slowly eroded until it needs to read in the next row group.
   
    - [ ] Add tests to verify memory usage
    - [ ] Update docs to mention that S3 users may want to increase the 
fragment readahead but this will come at the cost of more RAM usage.
    - [ ] Update docs to give some of this "expected memory usage" information


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] westonpace opened a new pull request #12228: ARROW-14510: [C++][Datasets] Improve memory usage of datasets API when scanning parquet

Reply via email to