[ 
https://issues.apache.org/jira/browse/ARROW-15730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17496338#comment-17496338
 ] 

Jameel Alsalam commented on ARROW-15730:
----------------------------------------

Trying again with html output to see if it renders better:

Hello, I think I have reproduced the issue here. About 1.5 GB appears to still 
be in use after the remove statement. I am on CRAN arrow 7.0.0. I was 
interested in this issue because I have tried to diagnose a different arrow 
memory issue involving write_dataset. In my investigations, the memory reported 
internally by gc() or arrow is quite different than what is reported by Windows 
via e.g., task manager. I have found a way to get the system task manager-like 
memory by running: `system2("tasklist", stdout=TRUE)` and then filtering for 
the right process. Pasted below I ran your script with the additional memory 
info.

 

<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>

 

<pre class="r"><code>library(arrow)
#&gt; 
#&gt; Attaching package: &#39;arrow&#39;
#&gt; The following object is masked from &#39;package:utils&#39;:
#&gt; 
#&gt;     timestamp

print_memory &lt;- function() {
  print(sprintf(&quot;Arrow: %s MB&quot;, 
trunc(arrow_info()$memory$bytes_allocated / 1024 / 1024)))
  print(sprintf(&quot;R: %s MB&quot;, gc()[&quot;Vcells&quot;, 2]))
  print((function(t) t[grep(Sys.getpid(), t)])(system2(&quot;tasklist&quot;, 
stdout = TRUE)))
}

# Create example data
size &lt;- 1E8

print_memory()
#&gt; [1] &quot;Arrow: 0 MB&quot;
#&gt; [1] &quot;R: 9.8 MB&quot;
#&gt; [1] &quot;Rterm.exe                     7704 Console                    2 
   125,572 K&quot;

my_table &lt;- arrow_table(
  x = Array$create(sample(letters, size, replace = TRUE)),
  y = Array$create(as.factor(sample(letters, size, replace = TRUE))),
  z = Array$create(as.Date(1:size, as.Date(&quot;2020-01-01&quot;))),
  a = Array$create(1:size, type=int32())
)

arrow::write_arrow(my_table, &quot;file.arrow5&quot;)
#&gt; Warning: Use &#39;write_ipc_stream&#39; or &#39;write_feather&#39; 
instead.
remove(my_table)

# Note: you may need to wait a few seconds for Arrow memory pool to free memory
Sys.sleep(5)
print_memory()
#&gt; [1] &quot;Arrow: 953 MB&quot;
#&gt; [1] &quot;R: 392.6 MB&quot;
#&gt; [1] &quot;Rterm.exe                     7704 Console                    2 
   562,728 K&quot;


options(arrow.use_threads=FALSE);

arrow::set_cpu_count(1); # need this - otherwise it freezes under windows

table &lt;- arrow::read_arrow(&#39;file.arrow5&#39;)
#&gt; Warning: Use &#39;read_ipc_stream&#39; or &#39;read_feather&#39; instead.
print_memory()
#&gt; [1] &quot;Arrow: 1335 MB&quot;
#&gt; [1] &quot;R: 1156.2 MB&quot;
#&gt; [1] &quot;Rterm.exe                     7704 Console                    2 
 2,709,192 K&quot;

remove(table)
Sys.sleep(5)
print_memory()
#&gt; [1] &quot;Arrow: 858 MB&quot;
#&gt; [1] &quot;R: 11.8 MB&quot;
#&gt; [1] &quot;Rterm.exe                     7704 Console                    2 
 1,533,436 K&quot;</code></pre>
<p><sup>Created on 2022-02-22 by the <a 
href="https://reprex.tidyverse.org";>reprex package</a> (v2.0.1)</sup></p>
<details style="margin-bottom:10px;">
<summary>
Session info
</summary>
<pre class="r"><code>sessioninfo::session_info()
#&gt; - Session info 
---------------------------------------------------------------
#&gt;  setting  value
#&gt;  version  R version 4.0.5 (2021-03-31)
#&gt;  os       Windows 10 x64 (build 19042)
#&gt;  system   x86_64, mingw32
#&gt;  ui       RTerm
#&gt;  language (EN)
#&gt;  collate  English_United States.1252
#&gt;  ctype    English_United States.1252
#&gt;  tz       America/New_York
#&gt;  date     2022-02-22
#&gt;  pandoc   2.11.4 @ C:/Program Files/RStudio/bin/pandoc/ (via rmarkdown)
#&gt; 
#&gt; - Packages 
-------------------------------------------------------------------
#&gt;  ! package     * version date (UTC) lib source
#&gt;    arrow       * 7.0.0   2022-02-10 [1] CRAN (R 4.0.5)
#&gt;  P assertthat    0.2.1   2019-03-21 [?] CRAN (R 4.0.5)
#&gt;  P backports     1.4.1   2021-12-13 [?] CRAN (R 4.0.5)
#&gt;  P bit           4.0.4   2020-08-04 [?] CRAN (R 4.0.5)
#&gt;  P bit64         4.0.5   2020-08-30 [?] CRAN (R 4.0.5)
#&gt;  P cli           3.2.0   2022-02-14 [?] CRAN (R 4.0.5)
#&gt;  P crayon        1.5.0   2022-02-14 [?] CRAN (R 4.0.5)
#&gt;  P digest        0.6.29  2021-12-01 [?] CRAN (R 4.0.5)
#&gt;  P ellipsis      0.3.2   2021-04-29 [?] CRAN (R 4.0.5)
#&gt;  P evaluate      0.14    2019-05-28 [?] CRAN (R 4.0.5)
#&gt;  P fansi         1.0.2   2022-01-14 [?] CRAN (R 4.0.5)
#&gt;  P fastmap       1.1.0   2021-01-25 [?] CRAN (R 4.0.5)
#&gt;  P fs            1.5.2   2021-12-08 [?] CRAN (R 4.0.5)
#&gt;  P glue          1.6.1   2022-01-22 [?] CRAN (R 4.0.5)
#&gt;  P highr         0.9     2021-04-16 [?] CRAN (R 4.0.5)
#&gt;  P htmltools     0.5.2   2021-08-25 [?] CRAN (R 4.0.5)
#&gt;  P knitr         1.37    2021-12-16 [?] CRAN (R 4.0.5)
#&gt;  P lifecycle     1.0.1   2021-09-24 [?] CRAN (R 4.0.5)
#&gt;  P magrittr      2.0.2   2022-01-26 [?] CRAN (R 4.0.5)
#&gt;  P pillar        1.7.0   2022-02-01 [?] CRAN (R 4.0.5)
#&gt;  P pkgconfig     2.0.3   2019-09-22 [?] CRAN (R 4.0.5)
#&gt;  P purrr         0.3.4   2020-04-17 [?] CRAN (R 4.0.5)
#&gt;    R.cache       0.15.0  2021-04-30 [2] CRAN (R 4.0.5)
#&gt;    R.methodsS3   1.8.1   2020-08-26 [2] CRAN (R 4.0.3)
#&gt;    R.oo          1.24.0  2020-08-26 [2] CRAN (R 4.0.3)
#&gt;    R.utils       2.11.0  2021-09-26 [2] CRAN (R 4.0.5)
#&gt;  P R6            2.5.1   2021-08-19 [?] CRAN (R 4.0.5)
#&gt;  P reprex        2.0.1   2021-08-05 [?] CRAN (R 4.0.5)
#&gt;  P rlang         1.0.1   2022-02-03 [?] CRAN (R 4.0.5)
#&gt;  P rmarkdown     2.11    2021-09-14 [?] CRAN (R 4.0.5)
#&gt;  P rstudioapi    0.13    2020-11-12 [?] CRAN (R 4.0.5)
#&gt;  P sessioninfo   1.2.2   2021-12-06 [?] CRAN (R 4.0.5)
#&gt;  P stringi       1.7.6   2021-11-29 [?] CRAN (R 4.0.5)
#&gt;  P stringr       1.4.0   2019-02-10 [?] CRAN (R 4.0.5)
#&gt;    styler        1.6.2   2021-09-23 [2] CRAN (R 4.0.5)
#&gt;  P tibble        3.1.6   2021-11-07 [?] CRAN (R 4.0.5)
#&gt;  P tidyselect    1.1.2   2022-02-21 [?] CRAN (R 4.0.5)
#&gt;  P utf8          1.2.2   2021-07-24 [?] CRAN (R 4.0.5)
#&gt;  P vctrs         0.3.8   2021-04-29 [?] CRAN (R 4.0.5)
#&gt;  P withr         2.4.3   2021-11-30 [?] CRAN (R 4.0.5)
#&gt;  P xfun          0.29    2021-12-14 [?] CRAN (R 4.0.5)
#&gt;  P yaml          2.3.5   2022-02-21 [?] CRAN (R 4.0.5)
#&gt; 
#&gt;  [1] 
C:/Users/jalsal02/R/renv/library/arrow-nightly-d7265b80/R-4.0/x86_64-w64-mingw32
#&gt;  [2] C:/Users/jalsal02/R/dev-library/4.0
#&gt;  [3] C:/Program Files/R/R-4.0.5/library
#&gt; 
#&gt;  P -- Loaded and on-disk path mismatch.
#&gt; 
#&gt; 
------------------------------------------------------------------------------</code></pre>
</details>

> [R] Memory usage in R blows up
> ------------------------------
>
>                 Key: ARROW-15730
>                 URL: https://issues.apache.org/jira/browse/ARROW-15730
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: R
>            Reporter: Christian
>            Assignee: Will Jones
>            Priority: Major
>             Fix For: 6.0.1
>
>         Attachments: image-2022-02-19-09-05-32-278.png
>
>
> Hi,
> I'm trying to load a ~10gb arrow file into R (under Windows)
> _(The file is generated in the 6.0.1 arrow version under Linux)._
> For whatever reason the memory usage blows up to ~110-120gb (in a fresh and 
> empty R instance).
> The weird thing is that when deleting the object again and running a gc() the 
> memory usage goes down to 90gb only. The delta of ~20-30gb is what I would 
> have expected the dataframe to use up in memory (and that's also approx. what 
> was used - in total during the load - when running the old arrow version of 
> 0.15.1. And it is also what R shows me when just printing the object size.)
> The commands I'm running are simply:
> options(arrow.use_threads=FALSE);
> arrow::set_cpu_count(1); # need this - otherwise it freezes under windows
> arrow::read_arrow('file.arrow5')
> Is arrow reserving some resources in the background and not giving them up 
> again? Are there some settings I need to change for this?
> Is this something that is known and fixed in a newer version?
> *Note* that this doesn't happen in Linux. There all the resources are freed 
> up when calling the gc() function - not sure if it matters but there I also 
> don't need to set the cpu count to 1.
> Any help would be appreciated.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to