[
https://issues.apache.org/jira/browse/ARROW-15730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17496143#comment-17496143
]
Christian edited comment on ARROW-15730 at 2/22/22, 3:02 PM:
-------------------------------------------------------------
I did some more testing (all the reading is done within R and Arrow 6.0.1). It
looks like there's a few things here:
1) I read a file that was written in Arrow 5 (the file is {+}30{+}gb and was
written directly with the C#/C++ interface) - that one increases the memory
usage to ~30-38. But then on gc() the memory usage goes down to 8gb and doesn't
free up everything. I'm not sure why that is but that's acceptable. The file
only has chr/Date/num/int. Calling arrow_info yields the following (same result
after loading/deleting the df).
Allocator mimalloc
Current 0 bytes
Max 0 bytes
2) Reading the file from last week ({+}10{+}gb written in Arrow 6.0.1 from R)
yields again the same result as last week. Note that here I have also the
factor/logical types which arrow seems to store and read.
Allocator mimalloc
Current 4.19 Kb
Max 34.31 Gb
3) As a test I did a write_arrow on the file from 2), but I did an unfactor on
all the factor columns. Same issue as in 2). So it doesn't look like it is the
factor type that's the issue.
4) As a final test I read the file from 2) and did a write_arrow on it from R.
The issue comes up again.
Before deletion:
Allocator mimalloc
Current 28.2 Gb
Max 28.2 Gb
After deletion:
Allocator mimalloc
Current 0 bytes
Max 28.2 Gb
###
So the issue seems to be with writing the arrow file from R. All I do is to
call a write_arrow('file.arrow5'). Is there a problem with that?
was (Author: klar):
I did some more testing (all the reading is done within R and Arrow 6.0.1). It
looks like there's a few things here:
1) I read a file that was written in Arrow 5 (the file is {+}30{+}gb and was
written directly with the C#/C++ interface) - that one increases the memory
usage to ~30-38. But then on gc() the memory usage goes down to 8gb and doesn't
free up everything. I'm not sure why that is but that's acceptable. The file
only has chr/Date/num/int. Calling arrow_info yields the following (same result
after loading/deleting the df).
Allocator mimalloc
Current 0 bytes
Max 0 bytes
2) Reading the file from last week ({+}10{+}gb written in Arrow 6.0.1 from R)
yields again the same result as last week. Note that here I have also the
factor/logical types which arrow seems to store and read.
Allocator mimalloc
Current 4.19 Kb
Max 34.31 Gb
3) As a test I did a write_arrow on the file from 2), but I did an unfactor on
all the factor columns. Same issue as in 2). So it doesn't look like it is the
factor type that's the issue.
So the issue seems to be with writing the arrow file from R.
> [R] Memory usage in R blows up
> ------------------------------
>
> Key: ARROW-15730
> URL: https://issues.apache.org/jira/browse/ARROW-15730
> Project: Apache Arrow
> Issue Type: Bug
> Components: R
> Reporter: Christian
> Priority: Major
> Fix For: 6.0.1
>
> Attachments: image-2022-02-19-09-05-32-278.png
>
>
> Hi,
> I'm trying to load a ~10gb arrow file into R (under Windows)
> _(The file is generated in the 6.0.1 arrow version under Linux)._
> For whatever reason the memory usage blows up to ~110-120gb (in a fresh and
> empty R instance).
> The weird thing is that when deleting the object again and running a gc() the
> memory usage goes down to 90gb only. The delta of ~20-30gb is what I would
> have expected the dataframe to use up in memory (and that's also approx. what
> was used - in total during the load - when running the old arrow version of
> 0.15.1. And it is also what R shows me when just printing the object size.)
> The commands I'm running are simply:
> options(arrow.use_threads=FALSE);
> arrow::set_cpu_count(1); # need this - otherwise it freezes under windows
> arrow::read_arrow('file.arrow5')
> Is arrow reserving some resources in the background and not giving them up
> again? Are there some settings I need to change for this?
> Is this something that is known and fixed in a newer version?
> *Note* that this doesn't happen in Linux. There all the resources are freed
> up when calling the gc() function - not sure if it matters but there I also
> don't need to set the cpu count to 1.
> Any help would be appreciated.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)