[jira] [Comment Edited] (ARROW-15730) [R] Memory usage in R blows up

Christian (Jira) Tue, 22 Feb 2022 07:03:13 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-15730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17496143#comment-17496143
 ]


Christian edited comment on ARROW-15730 at 2/22/22, 3:02 PM:
-------------------------------------------------------------

I did some more testing (all the reading is done within R and Arrow 6.0.1). It 
looks like there's a few things here:

1) I read a file that was written in Arrow 5 (the file is {+}30{+}gb and was 
written directly with the C#/C++ interface) - that one increases the memory 
usage to ~30-38. But then on gc() the memory usage goes down to 8gb and doesn't 
free up everything. I'm not sure why that is but that's acceptable. The file 
only has chr/Date/num/int. Calling arrow_info yields the following (same result 
after loading/deleting the df).

Allocator mimalloc
Current    0 bytes
Max        0 bytes

2) Reading the file from last week ({+}10{+}gb written in Arrow 6.0.1 from R) 
yields again the same result as last week. Note that here I have also the 
factor/logical types which arrow seems to store and read.

Allocator mimalloc
Current    4.19 Kb
Max       34.31 Gb

3) As a test I did a write_arrow on the file from 2), but I did an unfactor on 
all the factor columns. Same issue as in 2). So it doesn't look like it is the 
factor type that's the issue.

4) As a final test I read the file from 2) and did a write_arrow on it from R. 
The issue comes up again.

Before deletion:

Allocator mimalloc
Current    28.2 Gb
Max        28.2 Gb

After deletion:

Allocator mimalloc
Current    0 bytes
Max        28.2 Gb

 

###

So the issue seems to be with writing the arrow file from R. All I do is to 
call a write_arrow('file.arrow5'). Is there a problem with that?


was (Author: klar):
I did some more testing (all the reading is done within R and Arrow 6.0.1). It 
looks like there's a few things here:

1) I read a file that was written in Arrow 5 (the file is {+}30{+}gb and was 
written directly with the C#/C++ interface) - that one increases the memory 
usage to ~30-38. But then on gc() the memory usage goes down to 8gb and doesn't 
free up everything. I'm not sure why that is but that's acceptable. The file 
only has chr/Date/num/int. Calling arrow_info yields the following (same result 
after loading/deleting the df).

Allocator mimalloc
Current    0 bytes
Max        0 bytes

2) Reading the file from last week ({+}10{+}gb written in Arrow 6.0.1 from R) 
yields again the same result as last week. Note that here I have also the 
factor/logical types which arrow seems to store and read.

Allocator mimalloc
Current    4.19 Kb
Max       34.31 Gb

3) As a test I did a write_arrow on the file from 2), but I did an unfactor on 
all the factor columns. Same issue as in 2). So it doesn't look like it is the 
factor type that's the issue.

So the issue seems to be with writing the arrow file from R.

> [R] Memory usage in R blows up
> ------------------------------
>
>                 Key: ARROW-15730
>                 URL: https://issues.apache.org/jira/browse/ARROW-15730
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: R
>            Reporter: Christian
>            Priority: Major
>             Fix For: 6.0.1
>
>         Attachments: image-2022-02-19-09-05-32-278.png
>
>
> Hi,
> I'm trying to load a ~10gb arrow file into R (under Windows)
> _(The file is generated in the 6.0.1 arrow version under Linux)._
> For whatever reason the memory usage blows up to ~110-120gb (in a fresh and 
> empty R instance).
> The weird thing is that when deleting the object again and running a gc() the 
> memory usage goes down to 90gb only. The delta of ~20-30gb is what I would 
> have expected the dataframe to use up in memory (and that's also approx. what 
> was used - in total during the load - when running the old arrow version of 
> 0.15.1. And it is also what R shows me when just printing the object size.)
> The commands I'm running are simply:
> options(arrow.use_threads=FALSE);
> arrow::set_cpu_count(1); # need this - otherwise it freezes under windows
> arrow::read_arrow('file.arrow5')
> Is arrow reserving some resources in the background and not giving them up 
> again? Are there some settings I need to change for this?
> Is this something that is known and fixed in a newer version?
> *Note* that this doesn't happen in Linux. There all the resources are freed 
> up when calling the gc() function - not sure if it matters but there I also 
> don't need to set the cpu count to 1.
> Any help would be appreciated.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Comment Edited] (ARROW-15730) [R] Memory usage in R blows up

Reply via email to