[jira] [Created] (ARROW-14727) Excessive memory usage on Windows

Jira Tue, 16 Nov 2021 09:19:06 -0800

András Svraka created ARROW-14727:
-------------------------------------

             Summary: Excessive memory usage on Windows
                 Key: ARROW-14727
                 URL: https://issues.apache.org/jira/browse/ARROW-14727
             Project: Apache Arrow
          Issue Type: Bug
          Components: R
    Affects Versions: 6.0.0
            Reporter: András Svraka



I have the following workflow which worked on Arrow 5.0 on Windows 10 and R 
4.1.2:
{code:r}
open_dataset(path) %>%
  select(i, j) %>%
  collect()
{code}
The dataset in {{path}} is partitioned by {{i}} and {{{}j{}}}, with 16 
partitions in total, 5 million rows in each partition and each partition has 
several other regular columns (i.e. present in every partition). The entire 
dataset can be read into memory on my 16GB machine, which results in an R 
data.frame of around 3GB. However, on Arrow 6.0 the same operation fails, and R 
runs out of memory. Interestingly, this still works:
{code:r}
open_dataset(path) %>%
  select(i, j, x) %>%
  collect() %>%
{code}
where {{x}} is a regular column.

I cannot reproduce the same issue on Linux. Measuring the actual memory 
consumption with GNU time ({{{}--format=%Mmax{}}}) I get very similar figures 
for the first pipeline both on 5.0 and 6.0. The same is true for the second 
pipeline, which of course consumes slightly more memory, as expected. On 
Windows I don’t know of a simple method to measure maximum memory consumption 
but eyeballing it from Process Explorer, Arrow 5.0 needs around 0.5GB for the 
first example, while with Arrow 6.0 my 16GB machine becomes unresponsive, 
starts swapping, and depending on the circumstances, other apps might crash 
before R crashes with this error:
{noformat}
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc {noformat}
With the second example, both versions consume roughly the same amount of 
memory.

With the new features in Arrow 6.0, this doesn’t work in Windows either, memory 
consumption shoots up into the 10s of GBs:
{code:r}
open_dataset(path) %>%
  distinct(i, j) %>%
  collect()
{code}
Meanwhile this works, with under 1GB memory needed:
{code:r}
open_dataset(path) %>%
  distinct(i, j, x) %>%
  collect()
{code}
These last two examples work without any issue on Linux, and as expected, they 
consume significantly less memory, as the select-then-collect examples.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Created] (ARROW-14727) Excessive memory usage on Windows

Reply via email to