Mauricio 'Pachá' Vargas Sepúlveda created ARROW-12321:
---------------------------------------------------------

             Summary: [R] Arrow opens too many files at once when writing a 
dataset
                 Key: ARROW-12321
                 URL: https://issues.apache.org/jira/browse/ARROW-12321
             Project: Apache Arrow
          Issue Type: Bug
          Components: R
    Affects Versions: 3.0.0
            Reporter: Mauricio 'Pachá' Vargas Sepúlveda


_Related to https://issues.apache.org/jira/browse/ARROW-12315_

Please see 
https://drive.google.com/drive/folders/1e7WB36FPYzvdtm46dgAEEFAKAWDQs-e1?usp=sharing
 where I added the raw data and the output.

This works:

{code:java}

library(data.table)
library(dplyr)
library(arrow)

d <- fread(
        input = 
"01-raw-data/sitc-rev2/parquet/type-C_r-ALL_ps-2019_freq-A_px-S2_pub-20210216_fmt-csv_ex-20210227.csv",
        colClasses = list(
          character = "Commodity Code",
          numeric = c("Trade Value (US$)", "Qty", "Netweight (kg)")
        ))

d <- d %>%
  mutate(
    `Reporter ISO` = case_when(
      `Reporter ISO` %in% c(NA, "", " ") ~ "0-unspecified",
      TRUE ~ `Reporter ISO`
    ),
    `Partner ISO` = case_when(
      `Partner ISO` %in% c(NA, "", " ") ~ "0-unspecified",
      TRUE ~ `Partner ISO`
    )
  )

# d %>%
#   select(Year, `Reporter ISO`, `Partner ISO`) %>%
#   distinct() %>%
#   dim()

d %>%
  group_by(Year, `Reporter ISO`) %>%
  write_dataset("parquet", hive_style = F, max_partitions = 1024L)
{code}

But, if I add an additional column for partioning and increases the max 
partitions to 12808 (to pass exactly the number of partitions that it needs), I 
get the error:

{code:java}
d %>%
  group_by(Year, `Reporter ISO`) %>%
  write_dataset("parquet", hive_style = F, max_partitions = 12808)

Error: IOError: Failed to open local file 
'/media/pacha/pacha_backup/tradestatistics/yearly-datasets-arrow/01-raw-data/sitc-rev2/parquet/2019/SEN/MOZ/part-5353.parquet'.
 Detail: [errno 24] Too many open files
{code}







--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to