Mauricio 'Pachá' Vargas Sepúlveda created ARROW-12321:
---------------------------------------------------------
Summary: [R] Arrow opens too many files at once when writing a
dataset
Key: ARROW-12321
URL: https://issues.apache.org/jira/browse/ARROW-12321
Project: Apache Arrow
Issue Type: Bug
Components: R
Affects Versions: 3.0.0
Reporter: Mauricio 'Pachá' Vargas Sepúlveda
_Related to https://issues.apache.org/jira/browse/ARROW-12315_
Please see
https://drive.google.com/drive/folders/1e7WB36FPYzvdtm46dgAEEFAKAWDQs-e1?usp=sharing
where I added the raw data and the output.
This works:
{code:java}
library(data.table)
library(dplyr)
library(arrow)
d <- fread(
input =
"01-raw-data/sitc-rev2/parquet/type-C_r-ALL_ps-2019_freq-A_px-S2_pub-20210216_fmt-csv_ex-20210227.csv",
colClasses = list(
character = "Commodity Code",
numeric = c("Trade Value (US$)", "Qty", "Netweight (kg)")
))
d <- d %>%
mutate(
`Reporter ISO` = case_when(
`Reporter ISO` %in% c(NA, "", " ") ~ "0-unspecified",
TRUE ~ `Reporter ISO`
),
`Partner ISO` = case_when(
`Partner ISO` %in% c(NA, "", " ") ~ "0-unspecified",
TRUE ~ `Partner ISO`
)
)
# d %>%
# select(Year, `Reporter ISO`, `Partner ISO`) %>%
# distinct() %>%
# dim()
d %>%
group_by(Year, `Reporter ISO`) %>%
write_dataset("parquet", hive_style = F, max_partitions = 1024L)
{code}
But, if I add an additional column for partioning and increases the max
partitions to 12808 (to pass exactly the number of partitions that it needs), I
get the error:
{code:java}
d %>%
group_by(Year, `Reporter ISO`) %>%
write_dataset("parquet", hive_style = F, max_partitions = 12808)
Error: IOError: Failed to open local file
'/media/pacha/pacha_backup/tradestatistics/yearly-datasets-arrow/01-raw-data/sitc-rev2/parquet/2019/SEN/MOZ/part-5353.parquet'.
Detail: [errno 24] Too many open files
{code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)