[ 
https://issues.apache.org/jira/browse/ARROW-14736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17445433#comment-17445433
 ] 

Weston Pace commented on ARROW-14736:
-------------------------------------

I suspect this is going to be a problem until 
https://issues.apache.org/jira/browse/ARROW-14648 is resolved.  By specifying 
backpressure in terms of "# of batches" instead of "# of bytes" the limits 
depend very much on the shape of the input.

 

Furthermore, those limits were designed around the CSV reader (my mistake) 
which yields ~1MB sized batches.  So the readahead limit for CSV is 256MB while 
the readahead limit for parquet (where people often create 1GB sized row 
groups) is more like 256GB.

 

Try shaping your data so that row groups are 10MB large and see if that helps 
with memory pressure (I'd expect it to cap out around 3GB).  With a database 
that wide however small row groups are going to hurt performance so this is 
just a workaround and not a long term solution.


Even with ARROW-14648 your memory pressure is going to be red until ARROW-14635 
is addressed because a dataset write just hands off RAM to the OS.

> [C++][R]Opening a multi-file dataset and writing a re-partitioned version of 
> it fails
> -------------------------------------------------------------------------------------
>
>                 Key: ARROW-14736
>                 URL: https://issues.apache.org/jira/browse/ARROW-14736
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, R
>    Affects Versions: 6.0.0
>         Environment: M1 Mac, macOS Monterey 12.0.1, 16Gb RAM
> R 4.1.1, {arrow} R package 6.0.0.2 (release) & 6.0.0.9000 (dev)
>            Reporter: Dragoș Moldovan-Grünfeld
>            Priority: Major
>         Attachments: image-2021-11-17-14-43-37-127.png, 
> image-2021-11-17-14-54-42-747.png, image-2021-11-17-14-55-08-597.png
>
>
> Attempting to open a multi-file dataset and write a re-partitioned version of 
> it fails as it seems there is an attempt to collect data into memory first. 
> This happens both for wide and long data.
> Steps to reproduce the issue:
> 1. Create a large dataset (100k columns, 300k rows) and write it to disk and 
> create 20 copies of it. Each file will have a footprint of roughly 7.5GB. 
> {code:r}
> library(arrow)
> library(dplyr)
> library(fs)
> rows <- 300000
> cols <- 100000
> partitions <- 20
> wide_df <- as.data.frame(
>   matrix(
>     sample(1:32767, rows * cols / partitions, replace = TRUE), 
>     ncol = cols)
> )
> schem <- sapply(colnames(wide_df), function(nm) {int16()})
> schem <- do.call(schema, schem)
> wide_tab <- Table$create(wide_df, schema = schem)
> write_parquet(wide_tab, "~/Documents/arrow_playground/wide.parquet")
> fs::dir_create("~/Documents/arrow_playground/wide_ds")
> for (i in seq_len(partitions)) {
>   file.copy("~/Documents/arrow_playground/wide.parquet", 
>             
> glue::glue("~/Documents/arrow_playground/wide_ds/wide-{i-1}.parquet"))
> }
> ds_wide <- open_dataset("~/Documents/arrow_playground/wide_ds/")
> {code}
> All the following steps fail:
> 2. Creating and writing a partitioned version of {{{}ds_wide{}}}.
> {code:r}
>   ds_wide %>%
>     mutate(grouper = round(V1 / 1024)) %>%
>     write_dataset("~/Documents/arrow_playground/partitioned", 
>                    partitioning = "grouper",
>                    format = "parquet")
> {code}
> 3. Writing a non-partitioned dataset:
> {code:r}
>   ds_wide %>%
>     write_dataset("~/Documents/arrow_playground/partitioned", 
>                   format = "parquet")
> {code}
> 4. Creating the partitioning variable first and then attempting to write:
> {code:r}
>   ds2 <- ds_wide %>% 
>     mutate(grouper = round(V1 / 1024))
>   ds2 %>% 
>     write_dataset("~/Documents/arrow_playground/partitioned", 
>                   partitioning = "grouper", 
>                   format = "parquet")  
> {code}
> 5. Attempting to write to csv:
> {code:r}
> ds_wide %>% 
>   write_dataset("~/Documents/arrow_playground/csv_writing/test.csv",
>                 format = "csv")
> {code}
> None of the failures seem to originate in R code and they all result in a 
> similar behaviour: the R sessions consume increasing amounts of RAM until 
> they crash.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to