[
https://issues.apache.org/jira/browse/ARROW-16452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Weston Pace updated ARROW-16452:
--------------------------------
Description:
This might be "not a bug" but I wonder if we can do something better here.
When I create and execute a dplyr query there is a bunch of RAM that is left
allocated until the next GC pass.
Since R's garbage collection is only based on RAM that R has allocated this
extra memory (which can be quite substantial) might never be freed.
Perhaps we should just manually trigger a gc pass after running an execution
plan? Or it may be good to get a better understanding of what exactly this
memory is being used for.
In the example below I load ~2GB of data but after the collect there is ~3GB
used. I wait 10 seconds to ensure it's not jemalloc. Then I run {{gc()}}
manually and ~1GB is freed.
{noformat}
> dataset = arrow::open_dataset('/home/pace/dev/data/dataset/parquet/5')
> default_memory_pool()$bytes_allocated
[1] 64
> x <- dataset %>% collect(as_data_frame=FALSE)
> arrow::default_memory_pool()$bytes_allocated
[1] 2921135104
> Sys.sleep(10)
> arrow::default_memory_pool()$bytes_allocated
[1] 2921135104
> gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 917099 49.0 1498168 80.1 1498168 80.1
Vcells 1649894 12.6 8388608 64.0 2617403 20.0
> arrow::default_memory_pool()$bytes_allocated
[1] 2028716480
{noformat}
was:
This might be "not a bug" but I wonder if we can do something better here.
When I create and execute a dplyr query there is a bunch of RAM that is left
allocated until the next GC pass.
Since R's garbage collection is only based on RAM that R has allocated this
extra memory (which can be quite substantial) might never be freed.
Perhaps we should just manually trigger a gc pass after running an execution
plan? Or it may be good to get a better understanding of what exactly this
memory is being used for.
In the example below I load ~2GB of data but after the collect there is ~3GB
used. I wait 10 seconds to ensure it's not jemalloc. Then I run {{gc()}}
manually and ~1GB is freed.
{noformat}
> dataset = arrow::open_dataset('/home/pace/dev/data/dataset/parquet/5')
> default_memory_pool()$bytes_allocated
[1] 64
> x <- dataset %>% collect(as_data_frame=FALSE)
> arrow::default_memory_pool()$bytes_allocated
> Sys.sleep(10)
> arrow::default_memory_pool()$bytes_allocated
[1] 2921135104
> gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 917099 49.0 1498168 80.1 1498168 80.1
Vcells 1649894 12.6 8388608 64.0 2617403 20.0
> arrow::default_memory_pool()$bytes_allocated
[1] 2028716480
{noformat}
> [R] After dataset scan, some RAM is left consumed until a garbage collection
> pass
> ---------------------------------------------------------------------------------
>
> Key: ARROW-16452
> URL: https://issues.apache.org/jira/browse/ARROW-16452
> Project: Apache Arrow
> Issue Type: Improvement
> Components: R
> Reporter: Weston Pace
> Priority: Major
>
> This might be "not a bug" but I wonder if we can do something better here.
> When I create and execute a dplyr query there is a bunch of RAM that is left
> allocated until the next GC pass.
> Since R's garbage collection is only based on RAM that R has allocated this
> extra memory (which can be quite substantial) might never be freed.
> Perhaps we should just manually trigger a gc pass after running an execution
> plan? Or it may be good to get a better understanding of what exactly this
> memory is being used for.
> In the example below I load ~2GB of data but after the collect there is ~3GB
> used. I wait 10 seconds to ensure it's not jemalloc. Then I run {{gc()}}
> manually and ~1GB is freed.
> {noformat}
> > dataset = arrow::open_dataset('/home/pace/dev/data/dataset/parquet/5')
> > default_memory_pool()$bytes_allocated
> [1] 64
> > x <- dataset %>% collect(as_data_frame=FALSE)
> > arrow::default_memory_pool()$bytes_allocated
> [1] 2921135104
> > Sys.sleep(10)
> > arrow::default_memory_pool()$bytes_allocated
> [1] 2921135104
> > gc()
> used (Mb) gc trigger (Mb) max used (Mb)
> Ncells 917099 49.0 1498168 80.1 1498168 80.1
> Vcells 1649894 12.6 8388608 64.0 2617403 20.0
> > arrow::default_memory_pool()$bytes_allocated
> [1] 2028716480
> {noformat}
--
This message was sent by Atlassian Jira
(v8.20.7#820007)