Carl Boettiger created ARROW-17541:
--------------------------------------
Summary: [R] Substantial RAM use increase in 9.0.0 release on
write_dataset()
Key: ARROW-17541
URL: https://issues.apache.org/jira/browse/ARROW-17541
Project: Apache Arrow
Issue Type: Bug
Components: R
Affects Versions: 9.0.0
Reporter: Carl Boettiger
Consider the following example of opening a remote dataset (a single 4 GB
parquet file) and streaming it to disk. Consider this reprex:
s3 <- arrow::s3_bucket("data", endpoint_override = "minio3.ecoforecast.org",
anonymous=TRUE)
df <- arrow::open_dataset(s3$path("waq_test"))
arrow::write_dataset(df, tempfile())
In 8.0.0, this operation peaks at about ~10 GB RAM use, which is already
surprisingly high (when the whole file is 4 GB when on disk), but on arrow
9.0.0 RAM use for the same operation approximately doubles, which is large
enough to trigger the OOM killer on the task in several of our active
production workflows.
Can this large RAM use increase introduced in 9.0 be avoided? Is it possible
for this operation to use even less RAM than it does in 8.0 release? Is there
something about this particular parquet file that should be responsible for the
large RAM use?
Arrow's impressively fast performance on large data on remote hosts is really
game-changing for us. Still, the OOM errors are a bit unexpected at this scale
(i.e. single 4GB parquet file), as R users we really depend on arrow's
out-of-band operations to work with larger-than-RAM data.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)