Lucas Mation created ARROW-18176:
------------------------------------

             Summary: [R] arrow::open_dataset %>% select(myvars) %>% collect 
causes memory leak
                 Key: ARROW-18176
                 URL: https://issues.apache.org/jira/browse/ARROW-18176
             Project: Apache Arrow
          Issue Type: Bug
            Reporter: Lucas Mation


I first posted on StackOverlow, 
[here.|https://stackoverflow.com/questions/74221492/r-arrow-open-dataset-selectmyvars-collect-causing-memory-leak]

 

I am having trouble using arrow in R. First, I saved some {{data.tables}} that 
were about 50-60Gb ({{{}d{}}} in the code chunk) in memory to a parquet file 
using:
 
{{d %>% write_dataset(f, format='parquet')  # f is the directory name}}

Then I try to read open the file, select the relevant variables and
 
{{tic()d2 <-  open_dataset(f) %>% select(all_of(myvars)) %>% collect #myvars is 
a vector of variable namestoc()}}

I did this conversion for 3 sets of data.tables (unfortunately, data is 
confidential so I can't include in the example). In one set, I was able to 
{{open>select>collect}} the desired table in about 60s, obtaining a 10Gb file 
(after variable selection).

For the other two sets, the command caused a memory leak. tic()-toc() returned 
after 80s. But the object name (d2) never appeared in Rstudio's "Enviroment 
panel", and memory used keeps creeping up until it occupied most of the 
available RAM of the server, and then R crashed. Note the orginal dataset, 
without subsetting cols, was smaller than 60Gb and the server had 512GB.

Any ideas on what could be going on here?

 

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to