[
https://issues.apache.org/jira/browse/ARROW-15730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17496426#comment-17496426
]
Will Jones commented on ARROW-15730:
------------------------------------
{quote}Did the memory model (i.e. keeping a copy within arrow) change after
0.15 or was it introduced afterwards? That was the previous version I was
using and I never had these kind of memory "issues" with it (understood that
issues is not necessarily the right word). The double counting just seems very
punitive.
{quote}
No, as far as I know it should have gotten better since then, with fewer copies
made. This has been done by implementing "altrep" conversions to R vectors,
which allows R to use the existing Arrow array memory instead of copying data.
For each new version we've implemented this for additional data types. For
example, here's the PR for integers and numerics vectors:
[https://github.com/apache/arrow/pull/10445].
However, I just noticed that altrep was just implemented for ChunkedArray in
7.0.0, so you might not be getting the full benefit in 6.0.1 (since your 30GB
file is most likely made up of multiple chunks). So it is likely worth retrying
in 7.0.0.
{quote}I just tried "system" and it does free it up (as you said) but for a
while R is using about 70gb when the actual object size within R is just 30gb.
{quote}
It's hard to do any computation (read, aggregate, write, whatever) without
creating some sort of intermediate result. For a 30GB file, that sounds pretty
normal. You saying you can measure lower peak memory use in Arrow 0.15?
{quote}Do you know if R Factors (show up as dictionary in the table schema) are
especially punitive compared to strings?
{quote}
It's hard to say, and I think depends on your R version. But in 6.0.1 altrep
was implemented for strings, and it won't be implemented for factors until
8.0.0. I think the best thing to do would be to save a file with just a string
or just a factor and then test the peak vs result memory of each.
{quote}Were you able to set Sys.setenv(ARROW_DEFAULT_MEMORY_POOL="system")
within Rstudio? I tried a few different ways and it always just shows me
mimalloc. It does work in a R console window (which is where I did the above
test).
I even completely restarted Rstudio - whatever I do it stays at mimalloc.
{quote}
Yes I tested in Rstudio. Make sure to do Session > Restart R before you do this.
> [R] Memory usage in R blows up
> ------------------------------
>
> Key: ARROW-15730
> URL: https://issues.apache.org/jira/browse/ARROW-15730
> Project: Apache Arrow
> Issue Type: Bug
> Components: R
> Reporter: Christian
> Assignee: Will Jones
> Priority: Major
> Fix For: 6.0.1
>
> Attachments: image-2022-02-19-09-05-32-278.png
>
>
> Hi,
> I'm trying to load a ~10gb arrow file into R (under Windows)
> _(The file is generated in the 6.0.1 arrow version under Linux)._
> For whatever reason the memory usage blows up to ~110-120gb (in a fresh and
> empty R instance).
> The weird thing is that when deleting the object again and running a gc() the
> memory usage goes down to 90gb only. The delta of ~20-30gb is what I would
> have expected the dataframe to use up in memory (and that's also approx. what
> was used - in total during the load - when running the old arrow version of
> 0.15.1. And it is also what R shows me when just printing the object size.)
> The commands I'm running are simply:
> options(arrow.use_threads=FALSE);
> arrow::set_cpu_count(1); # need this - otherwise it freezes under windows
> arrow::read_arrow('file.arrow5')
> Is arrow reserving some resources in the background and not giving them up
> again? Are there some settings I need to change for this?
> Is this something that is known and fixed in a newer version?
> *Note* that this doesn't happen in Linux. There all the resources are freed
> up when calling the gc() function - not sure if it matters but there I also
> don't need to set the cpu count to 1.
> Any help would be appreciated.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)