[
https://issues.apache.org/jira/browse/ARROW-15730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17496338#comment-17496338
]
Jameel Alsalam commented on ARROW-15730:
----------------------------------------
Trying again with html output to see if it renders better:
Hello, I think I have reproduced the issue here. About 1.5 GB appears to still
be in use after the remove statement. I am on CRAN arrow 7.0.0. I was
interested in this issue because I have tried to diagnose a different arrow
memory issue involving write_dataset. In my investigations, the memory reported
internally by gc() or arrow is quite different than what is reported by Windows
via e.g., task manager. I have found a way to get the system task manager-like
memory by running: `system2("tasklist", stdout=TRUE)` and then filtering for
the right process. Pasted below I ran your script with the additional memory
info.
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<pre class="r"><code>library(arrow)
#>
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#>
#> timestamp
print_memory <- function() {
print(sprintf("Arrow: %s MB",
trunc(arrow_info()$memory$bytes_allocated / 1024 / 1024)))
print(sprintf("R: %s MB", gc()["Vcells", 2]))
print((function(t) t[grep(Sys.getpid(), t)])(system2("tasklist",
stdout = TRUE)))
}
# Create example data
size <- 1E8
print_memory()
#> [1] "Arrow: 0 MB"
#> [1] "R: 9.8 MB"
#> [1] "Rterm.exe 7704 Console 2
125,572 K"
my_table <- arrow_table(
x = Array$create(sample(letters, size, replace = TRUE)),
y = Array$create(as.factor(sample(letters, size, replace = TRUE))),
z = Array$create(as.Date(1:size, as.Date("2020-01-01"))),
a = Array$create(1:size, type=int32())
)
arrow::write_arrow(my_table, "file.arrow5")
#> Warning: Use 'write_ipc_stream' or 'write_feather'
instead.
remove(my_table)
# Note: you may need to wait a few seconds for Arrow memory pool to free memory
Sys.sleep(5)
print_memory()
#> [1] "Arrow: 953 MB"
#> [1] "R: 392.6 MB"
#> [1] "Rterm.exe 7704 Console 2
562,728 K"
options(arrow.use_threads=FALSE);
arrow::set_cpu_count(1); # need this - otherwise it freezes under windows
table <- arrow::read_arrow('file.arrow5')
#> Warning: Use 'read_ipc_stream' or 'read_feather' instead.
print_memory()
#> [1] "Arrow: 1335 MB"
#> [1] "R: 1156.2 MB"
#> [1] "Rterm.exe 7704 Console 2
2,709,192 K"
remove(table)
Sys.sleep(5)
print_memory()
#> [1] "Arrow: 858 MB"
#> [1] "R: 11.8 MB"
#> [1] "Rterm.exe 7704 Console 2
1,533,436 K"</code></pre>
<p><sup>Created on 2022-02-22 by the <a
href="https://reprex.tidyverse.org">reprex package</a> (v2.0.1)</sup></p>
<details style="margin-bottom:10px;">
<summary>
Session info
</summary>
<pre class="r"><code>sessioninfo::session_info()
#> - Session info
---------------------------------------------------------------
#> setting value
#> version R version 4.0.5 (2021-03-31)
#> os Windows 10 x64 (build 19042)
#> system x86_64, mingw32
#> ui RTerm
#> language (EN)
#> collate English_United States.1252
#> ctype English_United States.1252
#> tz America/New_York
#> date 2022-02-22
#> pandoc 2.11.4 @ C:/Program Files/RStudio/bin/pandoc/ (via rmarkdown)
#>
#> - Packages
-------------------------------------------------------------------
#> ! package * version date (UTC) lib source
#> arrow * 7.0.0 2022-02-10 [1] CRAN (R 4.0.5)
#> P assertthat 0.2.1 2019-03-21 [?] CRAN (R 4.0.5)
#> P backports 1.4.1 2021-12-13 [?] CRAN (R 4.0.5)
#> P bit 4.0.4 2020-08-04 [?] CRAN (R 4.0.5)
#> P bit64 4.0.5 2020-08-30 [?] CRAN (R 4.0.5)
#> P cli 3.2.0 2022-02-14 [?] CRAN (R 4.0.5)
#> P crayon 1.5.0 2022-02-14 [?] CRAN (R 4.0.5)
#> P digest 0.6.29 2021-12-01 [?] CRAN (R 4.0.5)
#> P ellipsis 0.3.2 2021-04-29 [?] CRAN (R 4.0.5)
#> P evaluate 0.14 2019-05-28 [?] CRAN (R 4.0.5)
#> P fansi 1.0.2 2022-01-14 [?] CRAN (R 4.0.5)
#> P fastmap 1.1.0 2021-01-25 [?] CRAN (R 4.0.5)
#> P fs 1.5.2 2021-12-08 [?] CRAN (R 4.0.5)
#> P glue 1.6.1 2022-01-22 [?] CRAN (R 4.0.5)
#> P highr 0.9 2021-04-16 [?] CRAN (R 4.0.5)
#> P htmltools 0.5.2 2021-08-25 [?] CRAN (R 4.0.5)
#> P knitr 1.37 2021-12-16 [?] CRAN (R 4.0.5)
#> P lifecycle 1.0.1 2021-09-24 [?] CRAN (R 4.0.5)
#> P magrittr 2.0.2 2022-01-26 [?] CRAN (R 4.0.5)
#> P pillar 1.7.0 2022-02-01 [?] CRAN (R 4.0.5)
#> P pkgconfig 2.0.3 2019-09-22 [?] CRAN (R 4.0.5)
#> P purrr 0.3.4 2020-04-17 [?] CRAN (R 4.0.5)
#> R.cache 0.15.0 2021-04-30 [2] CRAN (R 4.0.5)
#> R.methodsS3 1.8.1 2020-08-26 [2] CRAN (R 4.0.3)
#> R.oo 1.24.0 2020-08-26 [2] CRAN (R 4.0.3)
#> R.utils 2.11.0 2021-09-26 [2] CRAN (R 4.0.5)
#> P R6 2.5.1 2021-08-19 [?] CRAN (R 4.0.5)
#> P reprex 2.0.1 2021-08-05 [?] CRAN (R 4.0.5)
#> P rlang 1.0.1 2022-02-03 [?] CRAN (R 4.0.5)
#> P rmarkdown 2.11 2021-09-14 [?] CRAN (R 4.0.5)
#> P rstudioapi 0.13 2020-11-12 [?] CRAN (R 4.0.5)
#> P sessioninfo 1.2.2 2021-12-06 [?] CRAN (R 4.0.5)
#> P stringi 1.7.6 2021-11-29 [?] CRAN (R 4.0.5)
#> P stringr 1.4.0 2019-02-10 [?] CRAN (R 4.0.5)
#> styler 1.6.2 2021-09-23 [2] CRAN (R 4.0.5)
#> P tibble 3.1.6 2021-11-07 [?] CRAN (R 4.0.5)
#> P tidyselect 1.1.2 2022-02-21 [?] CRAN (R 4.0.5)
#> P utf8 1.2.2 2021-07-24 [?] CRAN (R 4.0.5)
#> P vctrs 0.3.8 2021-04-29 [?] CRAN (R 4.0.5)
#> P withr 2.4.3 2021-11-30 [?] CRAN (R 4.0.5)
#> P xfun 0.29 2021-12-14 [?] CRAN (R 4.0.5)
#> P yaml 2.3.5 2022-02-21 [?] CRAN (R 4.0.5)
#>
#> [1]
C:/Users/jalsal02/R/renv/library/arrow-nightly-d7265b80/R-4.0/x86_64-w64-mingw32
#> [2] C:/Users/jalsal02/R/dev-library/4.0
#> [3] C:/Program Files/R/R-4.0.5/library
#>
#> P -- Loaded and on-disk path mismatch.
#>
#>
------------------------------------------------------------------------------</code></pre>
</details>
> [R] Memory usage in R blows up
> ------------------------------
>
> Key: ARROW-15730
> URL: https://issues.apache.org/jira/browse/ARROW-15730
> Project: Apache Arrow
> Issue Type: Bug
> Components: R
> Reporter: Christian
> Assignee: Will Jones
> Priority: Major
> Fix For: 6.0.1
>
> Attachments: image-2022-02-19-09-05-32-278.png
>
>
> Hi,
> I'm trying to load a ~10gb arrow file into R (under Windows)
> _(The file is generated in the 6.0.1 arrow version under Linux)._
> For whatever reason the memory usage blows up to ~110-120gb (in a fresh and
> empty R instance).
> The weird thing is that when deleting the object again and running a gc() the
> memory usage goes down to 90gb only. The delta of ~20-30gb is what I would
> have expected the dataframe to use up in memory (and that's also approx. what
> was used - in total during the load - when running the old arrow version of
> 0.15.1. And it is also what R shows me when just printing the object size.)
> The commands I'm running are simply:
> options(arrow.use_threads=FALSE);
> arrow::set_cpu_count(1); # need this - otherwise it freezes under windows
> arrow::read_arrow('file.arrow5')
> Is arrow reserving some resources in the background and not giving them up
> again? Are there some settings I need to change for this?
> Is this something that is known and fixed in a newer version?
> *Note* that this doesn't happen in Linux. There all the resources are freed
> up when calling the gc() function - not sure if it matters but there I also
> don't need to set the cpu count to 1.
> Any help would be appreciated.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)