[ 
https://issues.apache.org/jira/browse/ARROW-15730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17496143#comment-17496143
 ] 

Christian edited comment on ARROW-15730 at 2/22/22, 3:03 PM:
-------------------------------------------------------------

I did some more testing (all the reading is done within R and Arrow 6.0.1). It 
looks like there's a few things here:

1) I read a file that was written in Arrow 5 (the file is {+}30{+}gb and was 
written directly with the C#/C++ interface) - that one increases the memory 
usage to ~30-38. But then on gc() the memory usage goes down to 8gb and doesn't 
free up everything. I'm not sure why that is but that's acceptable. The file 
only has chr/Date/num/int. Calling arrow_info yields the following (same result 
after loading/deleting the df).

Allocator mimalloc
Current    0 bytes
Max        0 bytes

2) Reading the file from last week ({+}10{+}gb written in Arrow 6.0.1 from R) 
yields again the same result as last week. Note that here I have also the 
factor/logical types which arrow seems to store and read.

Allocator mimalloc
Current    4.19 Kb
Max       34.31 Gb

3) As a test I did a write_arrow on the file from 2), but I did an unfactor on 
all the factor columns. Same issue as in 2). So it doesn't look like it is the 
factor type that's the issue.

4) As a final test I read the file from 1) and did a write_arrow on it from R. 
The issue comes up again after reading it back in.

Before deletion:

Allocator mimalloc
Current    28.2 Gb
Max        28.2 Gb

After deletion:

Allocator mimalloc
Current    0 bytes
Max        28.2 Gb

 

###

So the issue seems to be with writing the arrow file from R. All I do is to 
call a write_arrow('file.arrow5'). Is there a problem with that?


was (Author: klar):
I did some more testing (all the reading is done within R and Arrow 6.0.1). It 
looks like there's a few things here:

1) I read a file that was written in Arrow 5 (the file is {+}30{+}gb and was 
written directly with the C#/C++ interface) - that one increases the memory 
usage to ~30-38. But then on gc() the memory usage goes down to 8gb and doesn't 
free up everything. I'm not sure why that is but that's acceptable. The file 
only has chr/Date/num/int. Calling arrow_info yields the following (same result 
after loading/deleting the df).

Allocator mimalloc
Current    0 bytes
Max        0 bytes

2) Reading the file from last week ({+}10{+}gb written in Arrow 6.0.1 from R) 
yields again the same result as last week. Note that here I have also the 
factor/logical types which arrow seems to store and read.

Allocator mimalloc
Current    4.19 Kb
Max       34.31 Gb

3) As a test I did a write_arrow on the file from 2), but I did an unfactor on 
all the factor columns. Same issue as in 2). So it doesn't look like it is the 
factor type that's the issue.

4) As a final test I read the file from 1) and did a write_arrow on it from R. 
The issue comes up again.

Before deletion:

Allocator mimalloc
Current    28.2 Gb
Max        28.2 Gb

After deletion:

Allocator mimalloc
Current    0 bytes
Max        28.2 Gb

 

###

So the issue seems to be with writing the arrow file from R. All I do is to 
call a write_arrow('file.arrow5'). Is there a problem with that?

> [R] Memory usage in R blows up
> ------------------------------
>
>                 Key: ARROW-15730
>                 URL: https://issues.apache.org/jira/browse/ARROW-15730
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: R
>            Reporter: Christian
>            Priority: Major
>             Fix For: 6.0.1
>
>         Attachments: image-2022-02-19-09-05-32-278.png
>
>
> Hi,
> I'm trying to load a ~10gb arrow file into R (under Windows)
> _(The file is generated in the 6.0.1 arrow version under Linux)._
> For whatever reason the memory usage blows up to ~110-120gb (in a fresh and 
> empty R instance).
> The weird thing is that when deleting the object again and running a gc() the 
> memory usage goes down to 90gb only. The delta of ~20-30gb is what I would 
> have expected the dataframe to use up in memory (and that's also approx. what 
> was used - in total during the load - when running the old arrow version of 
> 0.15.1. And it is also what R shows me when just printing the object size.)
> The commands I'm running are simply:
> options(arrow.use_threads=FALSE);
> arrow::set_cpu_count(1); # need this - otherwise it freezes under windows
> arrow::read_arrow('file.arrow5')
> Is arrow reserving some resources in the background and not giving them up 
> again? Are there some settings I need to change for this?
> Is this something that is known and fixed in a newer version?
> *Note* that this doesn't happen in Linux. There all the resources are freed 
> up when calling the gc() function - not sure if it matters but there I also 
> don't need to set the cpu count to 1.
> Any help would be appreciated.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to