[ 
https://issues.apache.org/jira/browse/ARROW-10080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

András Svraka updated ARROW-10080:
----------------------------------
    Description: 
I’m having problems when {{collect()}}-ing Arrow data sources into data frames 
that are close in size to the available memory on the machine. Consider the 
following workflow. I have a dataset which I want to query so that at some 
point in needs to be {{collect()}}-ed but at the same I’m also reducing the 
result. During the intermediate step the entire data frame fits into memory, 
and the following code runs without any problems.
{code:r}
test_ds <- "memory_test"

ds1 <- open_dataset(test_ds) %>%
  collect() %>%
  dim()
{code}
However, running the same code in the same R session again fails with R running 
out of memory.
{code:r}
ds1 <- open_dataset(test_ds) %>%
  collect() %>%
  dim()
{code}
The example might be a but contrived but you can easily imagine a workflow 
where different queries are ran on a dataset and the reduced results are stored.

As far as I understand, R is a garbage collected language, and in this case 
there aren’t any references left to large objects in memory. And indeed, the 
second query succeeds when manually forcing a garbage collection.

Is this the expected behaviour from Arrow?

I know, this is quite hard to reproduce, as the exact dataset size required to 
trigger this behaviour depends on the particular machine but I prepared a 
reproducible example in [this 
gist|https://gist.github.com/svraka/c63fca51c6cc50020551e2319ff652b7], that 
should give the same result on Ubuntu 20.04 with 1GB RAM and no swap. See 
attachment for {{sessionInfo()}} output. I ran it on a Digitalocean 
{{s-1vcpu-1gb}} droplet.

First, let’s create a a partitioned Arrow dataset:
{code:java}
$ Rscript ds_prep.R 1000000 5
{code}
The first command line argument gives the number of rows in each partition, and 
second gives the number of partitions. The parameters are set so that the 
entire dataset should fit into memory.

Then running the two queries fails:
{code:java}
$ Rscript ds_read.R
Running query, 1st try...
ds size, 1st run: 56
Running query, 2nd try...
[1]    11151 killed     Rscript ds_read.R
{code}
However, when forcing a {{gc()}} (which I’m controlling here with a command 
line argument), it succeeds:
{code:java}
$ Rscript ds_read.R 1
Running query, 1st try...
ds size, 1st run: 56
running gc() ...
          used (Mb) gc trigger  (Mb) max used  (Mb)
Ncells  703052 37.6    1571691  84.0  1038494  55.5
Vcells 1179578  9.0   36405636 277.8 41188956 314.3
Running query, 2nd try...
ds size, 2nd run: 56
{code}
In general, [one shouldn’t have to use {{gc()}} 
manually|https://adv-r.hadley.nz/names-values.html#gc]. Interestingly, setting 
R’s garbage collection more aggressive (see {{?Memory}}) doesn’t help either:
{code:java}
$ R_GC_MEM_GROW=0 Rscript ds_read.R
Running query, 1st try...
ds size, 1st run: 56
Running query, 2nd try...
[1]    11422 killed     Rscript ds_read.R
{code}
I didn’t try to reproduce this problem on macOS, as my Mac would probably start 
swapping furiously but I managed to reproduce it on a Windows 7 machine with 
practically no swap. Of course the parameters are different, and the error 
messages are presumably system specific.
{code:java}
$ Rscript ds_prep.R 1000000 40
$ Rscript ds_read.R
Running query, 1st try...
ds size, 1st run: 56
Running query, 2nd try...
Error in dataset___Scanner__ToTable(self) :
  IOError: Out of memory: malloc of size 524288 failed
Calls: collect ... shared_ptr -> shared_ptr_is_null -> 
dataset___Scanner__ToTable
Execution halted
$ Rscript ds_read.R 1
Running query, 1st try...
ds size, 1st run: 56
running gc() ...
          used (Mb) gc trigger   (Mb)  max used (Mb)
Ncells  688789 36.8    1198030   64.0   1198030   64
Vcells 1109451  8.5  271538343 2071.7 321118845 2450
Running query, 2nd try...
ds size, 2nd run: 56
$ R_GC_MEM_GROW=0 Rscript ds_read.R
Running query, 1st try...
ds size, 1st run: 56
Running query, 2nd try...
Error in dataset___Scanner__ToTable(self) :
  IOError: Out of memory: malloc of size 524288 failed
Calls: collect ... shared_ptr -> shared_ptr_is_null -> 
dataset___Scanner__ToTable
Execution halted
{code}

  was:
I’m having problems when {{collect()}}-ing Arrow data sources into data frames 
that are close in size to the available memory on the machine. Consider the 
following workflow. I have a dataset which I want to query so that at some 
point in needs to be {{collect()}}-ed but at the same I’m also reducing the 
result. During the intermediate step the entire data frame fits into memory, 
and the following code runs without any problems.
{code:r}
test_ds <- "memory_test"

ds1 <- open_dataset(test_ds) %>%
  collect() %>%
  dim()
{code}
However, running the same code in the same R session again fails with R running 
out of memory.
{code:r}
ds1 <- open_dataset(test_ds) %>%
  collect() %>%
  dim()
{code}
The example might be a but contrived but you can easily imagine a workflow 
where different queries are ran on a dataset and the reduced results are stored.

As far as I understand, R is a garbage collected language, and in this case 
there aren’t any references left to large objects in memory. And indeed, the 
second query succeeds when manually forcing a garbage collection.

Is this the expected behaviour from Arrow?

I know, this is quite hard to reproduce, as the exact dataset size required to 
trigger this behaviour depends on the particular machine but I prepared a 
reproducible example in [this 
gist|https://gist.github.com/svraka/c63fca51c6cc50020551e2319ff652b7], that 
should give the same result on Ubuntu 20.04 with 1GB RAM and no swap. See 
attachment for {{sessionInfo()}} output. I ran it on a Digitalocean 
{{s-1vcpu-1gb}} droplet.

First, let’s create a a partitioned Arrow dataset:
{code:java}
$ Rscript ds_prep.R 1000000 5
{code}
The first command line argument gives the number of rows in each partition, and 
second gives the number of partitions. The parameters are set so that the 
entire dataset should fit into memory.

Then running the two queries fails:
{code:java}
$ Rscript ds_read.R
Running query, 1st try...
ds size, 1st run: 56
Running query, 2nd try...
[1]    11151 killed     Rscript ds_read.R
{code}
However, when forcing a {{gc()}} (which I’m controlling here with a command 
line argument), it succeeds:
{code:java}
$ Rscript ds_read.R 1
Running query, 1st try...
ds size, 1st run: 56
running gc() ...
          used (Mb) gc trigger  (Mb) max used  (Mb)
Ncells  703052 37.6    1571691  84.0  1038494  55.5
Vcells 1179578  9.0   36405636 277.8 41188956 314.3
Running query, 2nd try...
ds size, 2nd run: 56
{code}
In general, [one shouldn’t have to use {{gc()}} 
manually|https://adv-r.hadley.nz/names-values.html#gc]. Interestingly, setting 
R’s garbage collection more aggressive (see {{?Memory}}) doesn’t help either:
{code:java}
$ R_GC_MEM_GROW=0 Rscript ds_read.R
Running query, 1st try...
ds size, 1st run: 56
Running query, 2nd try...
[1]    11422 killed     Rscript ds_read.R
{code}
I didn’t try to reproduce this problem on macOS, as my Mac would probably start 
swapping furiously but I managed to reproduce it on a Windows 7 machine with 
practically no swap. Of course the parameters are different, and the error 
messages are presumably system specific.
{code:java}
$ Rscript ds_prep.R 1000000 40
$ Rscript ds_read.R
Running query, 1st try...
ds size, 1st run: 56
Running query, 2nd try...
Error in dataset___Scanner__ToTable(self) :
  IOError: Out of memory: malloc of size 524288 failed
Calls: collect ... shared_ptr -> shared_ptr_is_null -> 
dataset___Scanner__ToTable
Execution halted
$ Rscript ds_read.R 1
Running query, 1st try...
ds size, 1st run: 56
running gc() ...
          used (Mb) gc trigger   (Mb)  max used (Mb)
Ncells  688789 36.8    1198030   64.0   1198030   64
Vcells 1109451  8.5  271538343 2071.7 321118845 2450
Running query, 2nd try...
ds size, 2nd run: 56
$ R_GC_MEM_GROW=0 Rscript ds.R
Running query, 1st try...
ds size, 1st run: 56
Running query, 2nd try...
Error in dataset___Scanner__ToTable(self) :
  IOError: Out of memory: malloc of size 524288 failed
Calls: collect ... shared_ptr -> shared_ptr_is_null -> 
dataset___Scanner__ToTable
Execution halted
{code}


> Arrow does not release unused memory
> ------------------------------------
>
>                 Key: ARROW-10080
>                 URL: https://issues.apache.org/jira/browse/ARROW-10080
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: R
>    Affects Versions: 1.0.1
>         Environment: Linux, Windows
>            Reporter: András Svraka
>            Priority: Major
>         Attachments: sessioninfo.txt
>
>
> I’m having problems when {{collect()}}-ing Arrow data sources into data 
> frames that are close in size to the available memory on the machine. 
> Consider the following workflow. I have a dataset which I want to query so 
> that at some point in needs to be {{collect()}}-ed but at the same I’m also 
> reducing the result. During the intermediate step the entire data frame fits 
> into memory, and the following code runs without any problems.
> {code:r}
> test_ds <- "memory_test"
> ds1 <- open_dataset(test_ds) %>%
>   collect() %>%
>   dim()
> {code}
> However, running the same code in the same R session again fails with R 
> running out of memory.
> {code:r}
> ds1 <- open_dataset(test_ds) %>%
>   collect() %>%
>   dim()
> {code}
> The example might be a but contrived but you can easily imagine a workflow 
> where different queries are ran on a dataset and the reduced results are 
> stored.
> As far as I understand, R is a garbage collected language, and in this case 
> there aren’t any references left to large objects in memory. And indeed, the 
> second query succeeds when manually forcing a garbage collection.
> Is this the expected behaviour from Arrow?
> I know, this is quite hard to reproduce, as the exact dataset size required 
> to trigger this behaviour depends on the particular machine but I prepared a 
> reproducible example in [this 
> gist|https://gist.github.com/svraka/c63fca51c6cc50020551e2319ff652b7], that 
> should give the same result on Ubuntu 20.04 with 1GB RAM and no swap. See 
> attachment for {{sessionInfo()}} output. I ran it on a Digitalocean 
> {{s-1vcpu-1gb}} droplet.
> First, let’s create a a partitioned Arrow dataset:
> {code:java}
> $ Rscript ds_prep.R 1000000 5
> {code}
> The first command line argument gives the number of rows in each partition, 
> and second gives the number of partitions. The parameters are set so that the 
> entire dataset should fit into memory.
> Then running the two queries fails:
> {code:java}
> $ Rscript ds_read.R
> Running query, 1st try...
> ds size, 1st run: 56
> Running query, 2nd try...
> [1]    11151 killed     Rscript ds_read.R
> {code}
> However, when forcing a {{gc()}} (which I’m controlling here with a command 
> line argument), it succeeds:
> {code:java}
> $ Rscript ds_read.R 1
> Running query, 1st try...
> ds size, 1st run: 56
> running gc() ...
>           used (Mb) gc trigger  (Mb) max used  (Mb)
> Ncells  703052 37.6    1571691  84.0  1038494  55.5
> Vcells 1179578  9.0   36405636 277.8 41188956 314.3
> Running query, 2nd try...
> ds size, 2nd run: 56
> {code}
> In general, [one shouldn’t have to use {{gc()}} 
> manually|https://adv-r.hadley.nz/names-values.html#gc]. Interestingly, 
> setting R’s garbage collection more aggressive (see {{?Memory}}) doesn’t help 
> either:
> {code:java}
> $ R_GC_MEM_GROW=0 Rscript ds_read.R
> Running query, 1st try...
> ds size, 1st run: 56
> Running query, 2nd try...
> [1]    11422 killed     Rscript ds_read.R
> {code}
> I didn’t try to reproduce this problem on macOS, as my Mac would probably 
> start swapping furiously but I managed to reproduce it on a Windows 7 machine 
> with practically no swap. Of course the parameters are different, and the 
> error messages are presumably system specific.
> {code:java}
> $ Rscript ds_prep.R 1000000 40
> $ Rscript ds_read.R
> Running query, 1st try...
> ds size, 1st run: 56
> Running query, 2nd try...
> Error in dataset___Scanner__ToTable(self) :
>   IOError: Out of memory: malloc of size 524288 failed
> Calls: collect ... shared_ptr -> shared_ptr_is_null -> 
> dataset___Scanner__ToTable
> Execution halted
> $ Rscript ds_read.R 1
> Running query, 1st try...
> ds size, 1st run: 56
> running gc() ...
>           used (Mb) gc trigger   (Mb)  max used (Mb)
> Ncells  688789 36.8    1198030   64.0   1198030   64
> Vcells 1109451  8.5  271538343 2071.7 321118845 2450
> Running query, 2nd try...
> ds size, 2nd run: 56
> $ R_GC_MEM_GROW=0 Rscript ds_read.R
> Running query, 1st try...
> ds size, 1st run: 56
> Running query, 2nd try...
> Error in dataset___Scanner__ToTable(self) :
>   IOError: Out of memory: malloc of size 524288 failed
> Calls: collect ... shared_ptr -> shared_ptr_is_null -> 
> dataset___Scanner__ToTable
> Execution halted
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to