[jira] [Updated] (ARROW-14736) [C++][R]Opening a multi-file dataset and writing a re-partitioned version of it fails

Jira Wed, 17 Nov 2021 07:30:06 -0800


     [ 
https://issues.apache.org/jira/browse/ARROW-14736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Dragoș Moldovan-Grünfeld updated ARROW-14736:
---------------------------------------------
    Description: 
Attempting to open a multi-file dataset and write a re-partitioned version of 
it fails as it seems there is an attempt to collect data into memory first. 
This happens both for wide and long data.

Steps to reproduce the issue:
1. Create a large dataset (100k columns, 300k rows) and write it to disk and 
create 20 copies of it. Each file will have a footprint of roughly 7.5GB. 
{code:r}
library(arrow)
library(dplyr)
library(fs)

rows <- 300000
cols <- 100000
partitions <- 20

wide_df <- as.data.frame(
  matrix(
    sample(1:32767, rows * cols / partitions, replace = TRUE), 
    ncol = cols)
)

schem <- sapply(colnames(wide_df), function(nm) {int16()})
schem <- do.call(schema, schem)

wide_tab <- Table$create(wide_df, schema = schem)

write_parquet(wide_tab, "~/Documents/arrow_playground/wide.parquet")

fs::dir_create("~/Documents/arrow_playground/wide_ds")
for (i in seq_len(partitions)) {
  file.copy("~/Documents/arrow_playground/wide.parquet", 
            
glue::glue("~/Documents/arrow_playground/wide_ds/wide-{i-1}.parquet"))
}

ds_wide <- open_dataset("~/Documents/arrow_playground/wide_ds/")
{code}
All the following steps fail:

2. Creating and writing a partitioned version of {{{}ds_wide{}}}.
{code:r}
  ds_wide %>%
    mutate(grouper = round(V1 / 1024)) %>%
    write_dataset("~/Documents/arrow_playground/partitioned", 
                   partitioning = "grouper",
                   format = "parquet")
{code}
3. Writing a non-partitioned dataset:
{code:r}
  ds_wide %>%
    write_dataset("~/Documents/arrow_playground/partitioned", 
                  format = "parquet")
{code}
4. Creating the partitioning variable first and then attempting to write:
{code:r}
  ds2 <- ds_wide %>% 
    mutate(grouper = round(V1 / 1024))

  ds2 %>% 
    write_dataset("~/Documents/arrow_playground/partitioned", 
                  partitioning = "grouper", 
                  format = "parquet")  
{code}
5. Attempting to write to csv:
{code:r}
ds_wide %>% 
  write_dataset("~/Documents/arrow_playground/csv_writing/test.csv",
                format = "csv")
{code}
None of the failures seem to originate in R code and they all result in a 
similar behaviour: the R sessions consume increasing amounts of RAM until they 
crash.

  was:
Attempting to open a multi-file dataset and write a re-partitioned version of 
it fails as it seems there is an attempt to collect data into memory first. 
This happens both for wide and long data.

Steps to reproduce the issue:
1. Create a large dataset (100k columns, 300k rows) and write it to disk and 
create 20 copies of it. Each file will have a footprint of roughly 7.5GB. 
{code:r}
library(arrow)
library(dplyr)
library(fs)

rows <- 300000
cols <- 100000
partitions <- 20

wide_df <- as.data.frame(
  matrix(
    sample(1:32767, rows * cols / partitions, replace = TRUE), 
    ncol = cols)
)

schem <- sapply(colnames(wide_df), function(nm) {int16()})
schem <- do.call(schema, schem)

wide_tab <- Table$create(wide_df, schema = schem)

write_parquet(wide_tab, "~/Documents/arrow_playground/wide.parquet")

fs::dir_create("~/Documents/arrow_playground/wide_ds")
for (i in seq_len(partitions)) {
  file.copy("~/Documents/arrow_playground/wide.parquet", 
            
glue::glue("~/Documents/arrow_playground/wide_ds/wide-{i-1}.parquet"))
}

ds_wide <- open_dataset("~/Documents/arrow_playground/wide_ds/")
{code}
All the following steps fail:

2. Creating and writing a partitioned version of {{{}ds_wide{}}}.
{code:r}
system.time({
  ds_wide %>%
    mutate(grouper = round(V1 / 1024)) %>%
    write_dataset("~/Documents/arrow_playground/partitioned", 
                   partitioning = "grouper",
                   format = "parquet")
})
{code}
3. Writing a non-partitioned dataset:
{code:r}
system.time({
  ds_wide %>%
    write_dataset("~/Documents/arrow_playground/partitioned", 
                  format = "parquet")
})
{code}
4. Creating the partitioning variable first and then attempting to write:
{code:r}
system.time({
  ds2 <- ds_wide %>% 
    mutate(grouper = round(V1 / 1024))
})

system.time({
  ds2 %>% 
    write_dataset("~/Documents/arrow_playground/partitioned", 
                  partitioning = "grouper", 
                  format = "parquet")  
})
{code}

5. Attempting to write to csv:
{code:r}
ds_wide %>% 
  write_dataset("~/Documents/arrow_playground/csv_writing/test.csv",
                format = "csv")
{code}

None of the failures seem to originate in R code and they all result in a 
similar behaviour: the R sessions consume increasing amounts of RAM until they 
crash.


> [C++][R]Opening a multi-file dataset and writing a re-partitioned version of 
> it fails
> -------------------------------------------------------------------------------------
>
>                 Key: ARROW-14736
>                 URL: https://issues.apache.org/jira/browse/ARROW-14736
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, R
>    Affects Versions: 6.0.0
>         Environment: M1 Mac, macOS Monterey 12.0.1, 16Gb RAM
> R 4.1.1, {arrow} R package 6.0.0.2 (release) & 6.0.0.9000 (dev)
>            Reporter: Dragoș Moldovan-Grünfeld
>            Priority: Major
>         Attachments: image-2021-11-17-14-43-37-127.png, 
> image-2021-11-17-14-54-42-747.png, image-2021-11-17-14-55-08-597.png
>
>
> Attempting to open a multi-file dataset and write a re-partitioned version of 
> it fails as it seems there is an attempt to collect data into memory first. 
> This happens both for wide and long data.
> Steps to reproduce the issue:
> 1. Create a large dataset (100k columns, 300k rows) and write it to disk and 
> create 20 copies of it. Each file will have a footprint of roughly 7.5GB. 
> {code:r}
> library(arrow)
> library(dplyr)
> library(fs)
> rows <- 300000
> cols <- 100000
> partitions <- 20
> wide_df <- as.data.frame(
>   matrix(
>     sample(1:32767, rows * cols / partitions, replace = TRUE), 
>     ncol = cols)
> )
> schem <- sapply(colnames(wide_df), function(nm) {int16()})
> schem <- do.call(schema, schem)
> wide_tab <- Table$create(wide_df, schema = schem)
> write_parquet(wide_tab, "~/Documents/arrow_playground/wide.parquet")
> fs::dir_create("~/Documents/arrow_playground/wide_ds")
> for (i in seq_len(partitions)) {
>   file.copy("~/Documents/arrow_playground/wide.parquet", 
>             
> glue::glue("~/Documents/arrow_playground/wide_ds/wide-{i-1}.parquet"))
> }
> ds_wide <- open_dataset("~/Documents/arrow_playground/wide_ds/")
> {code}
> All the following steps fail:
> 2. Creating and writing a partitioned version of {{{}ds_wide{}}}.
> {code:r}
>   ds_wide %>%
>     mutate(grouper = round(V1 / 1024)) %>%
>     write_dataset("~/Documents/arrow_playground/partitioned", 
>                    partitioning = "grouper",
>                    format = "parquet")
> {code}
> 3. Writing a non-partitioned dataset:
> {code:r}
>   ds_wide %>%
>     write_dataset("~/Documents/arrow_playground/partitioned", 
>                   format = "parquet")
> {code}
> 4. Creating the partitioning variable first and then attempting to write:
> {code:r}
>   ds2 <- ds_wide %>% 
>     mutate(grouper = round(V1 / 1024))
>   ds2 %>% 
>     write_dataset("~/Documents/arrow_playground/partitioned", 
>                   partitioning = "grouper", 
>                   format = "parquet")  
> {code}
> 5. Attempting to write to csv:
> {code:r}
> ds_wide %>% 
>   write_dataset("~/Documents/arrow_playground/csv_writing/test.csv",
>                 format = "csv")
> {code}
> None of the failures seem to originate in R code and they all result in a 
> similar behaviour: the R sessions consume increasing amounts of RAM until 
> they crash.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-14736) [C++][R]Opening a multi-file dataset and writing a re-partitioned version of it fails

Reply via email to