[
https://issues.apache.org/jira/browse/ARROW-17541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17625079#comment-17625079
]
Dewey Dunnington commented on ARROW-17541:
------------------------------------------
This may or may not be related, but we have a report of "leaked memory" from a
dataset collect here:
https://stackoverflow.com/questions/74221492/r-arrow-open-dataset-selectmyvars-collect-causing-memory-leak
> [R] Substantial RAM use increase in 9.0.0 release on write_dataset()
> --------------------------------------------------------------------
>
> Key: ARROW-17541
> URL: https://issues.apache.org/jira/browse/ARROW-17541
> Project: Apache Arrow
> Issue Type: Bug
> Components: R
> Affects Versions: 9.0.0
> Reporter: Carl Boettiger
> Priority: Critical
> Attachments: Screenshot 2022-08-30 at 14-23-20 Online Graph Maker ·
> Plotly Chart Studio.png
>
>
> Consider the following example of opening a remote dataset (a single 4 GB
> parquet file) and streaming it to disk. Consider this reprex:
>
> {code:java}
> s3 <- arrow::s3_bucket("data", endpoint_override = "minio3.ecoforecast.org",
> anonymous=TRUE)
> df <- arrow::open_dataset(s3$path("waq_test"))
> arrow::write_dataset(df, tempfile())
> {code}
> In 8.0.0, this operation peaks at about ~10 GB RAM use, which is already
> surprisingly high (when the whole file is 4 GB when on disk), but on arrow
> 9.0.0 RAM use for the same operation approximately doubles, which is large
> enough to trigger the OOM killer on the task in several of our active
> production workflows.
>
> Can this large RAM use increase introduced in 9.0 be avoided? Is it possible
> for this operation to use even less RAM than it does in 8.0 release? Is
> there something about this particular parquet file that should be responsible
> for the large RAM use?
>
> Arrow's impressively fast performance on large data on remote hosts is really
> game-changing for us. Still, the OOM errors are a bit unexpected at this
> scale (i.e. single 4GB parquet file), as R users we really depend on arrow's
> out-of-band operations to work with larger-than-RAM data.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)