[jira] [Created] (ARROW-16452) [R] After dataset scan, some RAM is left consumed until a garbage collection pass

Weston Pace (Jira) Tue, 03 May 2022 19:59:05 -0700

Weston Pace created ARROW-16452:
-----------------------------------

             Summary: [R] After dataset scan, some RAM is left consumed until a 
garbage collection pass
                 Key: ARROW-16452
                 URL: https://issues.apache.org/jira/browse/ARROW-16452
             Project: Apache Arrow
          Issue Type: Improvement
          Components: R
            Reporter: Weston Pace



This might be "not a bug" but I wonder if we can do something better here.  
When I create and execute a dplyr query there is a bunch of RAM that is left 
allocated until the next GC pass.

Since R's garbage collection is only based on RAM that R has allocated this 
extra memory (which can be quite substantial) might never be freed.

Perhaps we should just manually trigger a gc pass after running an execution 
plan?  Or it may be good to get a better understanding of what exactly this 
memory is being used for.

In the example below I load ~2GB of data but after the collect there is ~3GB 
used.  I wait 10 seconds to ensure it's not jemalloc.  Then I run {{gc()}} 
manually and ~1GB is freed.

{noformat}
> dataset = arrow::open_dataset('/home/pace/dev/data/dataset/parquet/5')
> default_memory_pool()$bytes_allocated
[1] 64
> x <- dataset %>% collect(as_data_frame=FALSE)
> arrow::default_memory_pool()$bytes_allocated
> Sys.sleep(10)
> arrow::default_memory_pool()$bytes_allocated
[1] 2921135104
> gc()
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  917099 49.0    1498168 80.1  1498168 80.1
Vcells 1649894 12.6    8388608 64.0  2617403 20.0
> arrow::default_memory_pool()$bytes_allocated
[1] 2028716480
{noformat}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Created] (ARROW-16452) [R] After dataset scan, some RAM is left consumed until a garbage collection pass

Reply via email to