[jira] [Comment Edited] (ARROW-15730) [R] Memory usage in R blows up

Christian (Jira) Sat, 19 Feb 2022 06:17:07 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-15730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17494978#comment-17494978
 ]


Christian edited comment on ARROW-15730 at 2/19/22, 2:16 PM:
-------------------------------------------------------------

Apologies for the late reply. I just checked it after a full computer restart 
and it is exactly the same problem. Interestingly this time the full memory 
usage went to 90gb and then after deleting+gc() it got stuck at 60gb. So same 
problem - just a little bit lower total numbers. It holds both in Rstudio and 
in a R terminal.

Below the requested outputs. I also added what it shows on a gc() and what 
windows shows as resource usage.

This happens with as_data_frame=T (default setting of read_arrow), given that I 
don't need to make any changes to the df when loading it in.

And to reiterate - under Linux it frees up all resources after calling gc().

> arrow::arrow_info()
Arrow package version: 6.0.1

Capabilities:

dataset    TRUE
parquet    TRUE
json       TRUE
s3         TRUE
utf8proc   TRUE
re2        TRUE
snappy     TRUE
gzip       TRUE
brotli    FALSE
zstd       TRUE
lz4        TRUE
lz4_frame  TRUE
lzo       FALSE
bz2       FALSE
jemalloc  FALSE
mimalloc   TRUE

Arrow options():

arrow.use_threads FALSE

Memory:

Allocator mimalloc
Current    0 bytes
Max       34.31 Gb

Runtime:

SIMD Level          avx512
Detected SIMD Level avx512

Build:

C++ Library Version                                     6.0.1
C++ Compiler                                              GNU
C++ Compiler Version                                    8.3.0
Git ID               d132a740e33ec18c07b8718e15f85b4080a292ff

> gc()
          used (Mb) gc trigger    (Mb)   max used    (Mb)
Ncells 1792749 95.8    3428368   183.1    2914702   155.7
Vcells 4673226 35.7 2939373019 22425.7 3943230076 30084.5

> ls()
character(0)

 

!image-2022-02-19-09-05-32-278.png!

 

 


was (Author: klar):
Apologies for the late reply. I just checked it after a full computer restart 
and it is exactly the same problem. Interestingly this time the full memory 
usage went to 90gb and then after deleting+gc() it got stuck at 60gb. So same 
problem - just a little bit lower total numbers. It holds both in Rstudio and 
in a R terminal.

Below the requested outputs. I also added what it shows on a gc() and what 
windows shows as resource usage.

This happens with as_data_frame=T (default setting of read_arrow), given that I 
don't need to make any changes to the df when loading it in.

 

> arrow::arrow_info()
Arrow package version: 6.0.1

Capabilities:

dataset    TRUE
parquet    TRUE
json       TRUE
s3         TRUE
utf8proc   TRUE
re2        TRUE
snappy     TRUE
gzip       TRUE
brotli    FALSE
zstd       TRUE
lz4        TRUE
lz4_frame  TRUE
lzo       FALSE
bz2       FALSE
jemalloc  FALSE
mimalloc   TRUE

Arrow options():

arrow.use_threads FALSE

Memory:

Allocator mimalloc
Current    0 bytes
Max       34.31 Gb

Runtime:

SIMD Level          avx512
Detected SIMD Level avx512

Build:

C++ Library Version                                     6.0.1
C++ Compiler                                              GNU
C++ Compiler Version                                    8.3.0
Git ID               d132a740e33ec18c07b8718e15f85b4080a292ff

> gc()
          used (Mb) gc trigger    (Mb)   max used    (Mb)
Ncells 1792749 95.8    3428368   183.1    2914702   155.7
Vcells 4673226 35.7 2939373019 22425.7 3943230076 30084.5

> ls()
character(0)

 

!image-2022-02-19-09-05-32-278.png!

 

 

> [R] Memory usage in R blows up
> ------------------------------
>
>                 Key: ARROW-15730
>                 URL: https://issues.apache.org/jira/browse/ARROW-15730
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: R
>            Reporter: Christian
>            Priority: Major
>             Fix For: 6.0.1
>
>         Attachments: image-2022-02-19-09-05-32-278.png
>
>
> Hi,
> I'm trying to load a ~10gb arrow file into R (under Windows)
> _(The file is generated in the 6.0.1 arrow version under Linux)._
> For whatever reason the memory usage blows up to ~110-120gb (in a fresh and 
> empty R instance).
> The weird thing is that when deleting the object again and running a gc() the 
> memory usage goes down to 90gb only. The delta of ~20-30gb is what I would 
> have expected the dataframe to use up in memory (and that's also approx. what 
> was used - in total during the load - when running the old arrow version of 
> 0.15.1. And it is also what R shows me when just printing the object size.)
> The commands I'm running are simply:
> options(arrow.use_threads=FALSE);
> arrow::set_cpu_count(1); # need this - otherwise it freezes under windows
> arrow::read_arrow('file.arrow5')
> Is arrow reserving some resources in the background and not giving them up 
> again? Are there some settings I need to change for this?
> Is this something that is known and fixed in a newer version?
> *Note* that this doesn't happen in Linux. There all the resources are freed 
> up when calling the gc() function - not sure if it matters but there I also 
> don't need to set the cpu count to 1.
> Any help would be appreciated.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Comment Edited] (ARROW-15730) [R] Memory usage in R blows up

Reply via email to