[
https://issues.apache.org/jira/browse/ARROW-12321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Mauricio 'Pachá' Vargas Sepúlveda updated ARROW-12321:
------------------------------------------------------
Description:
_Related to:_ https://issues.apache.org/jira/browse/ARROW-12315
Please see
https://drive.google.com/drive/folders/1e7WB36FPYzvdtm46dgAEEFAKAWDQs-e1?usp=sharing
where I added the raw data and the output.
This works:
{code:java}
library(data.table)
library(dplyr)
library(arrow)
d <- fread(
input =
"01-raw-data/sitc-rev2/parquet/type-C_r-ALL_ps-2019_freq-A_px-S2_pub-20210216_fmt-csv_ex-20210227.csv",
colClasses = list(
character = "Commodity Code",
numeric = c("Trade Value (US$)", "Qty", "Netweight (kg)")
))
d <- d %>%
mutate(
`Reporter ISO` = case_when(
`Reporter ISO` %in% c(NA, "", " ") ~ "0-unspecified",
TRUE ~ `Reporter ISO`
),
`Partner ISO` = case_when(
`Partner ISO` %in% c(NA, "", " ") ~ "0-unspecified",
TRUE ~ `Partner ISO`
)
)
# d %>%
# select(Year, `Reporter ISO`, `Partner ISO`) %>%
# distinct() %>%
# dim()
d %>%
group_by(Year, `Reporter ISO`) %>%
write_dataset("parquet", hive_style = F, max_partitions = 1024L)
{code}
But, if I add an additional column for partioning and increases the max
partitions to 12808 (to pass exactly the number of partitions that it needs), I
get the error:
{code:java}
d %>%
group_by(Year, `Reporter ISO`) %>%
write_dataset("parquet", hive_style = F, max_partitions = 12808)
Error: IOError: Failed to open local file
'/media/pacha/pacha_backup/tradestatistics/yearly-datasets-arrow/01-raw-data/sitc-rev2/parquet/2019/SEN/MOZ/part-5353.parquet'.
Detail: [errno 24] Too many open files
{code}
was:
_Related to https://issues.apache.org/jira/browse/ARROW-12315_
Please see
https://drive.google.com/drive/folders/1e7WB36FPYzvdtm46dgAEEFAKAWDQs-e1?usp=sharing
where I added the raw data and the output.
This works:
{code:java}
library(data.table)
library(dplyr)
library(arrow)
d <- fread(
input =
"01-raw-data/sitc-rev2/parquet/type-C_r-ALL_ps-2019_freq-A_px-S2_pub-20210216_fmt-csv_ex-20210227.csv",
colClasses = list(
character = "Commodity Code",
numeric = c("Trade Value (US$)", "Qty", "Netweight (kg)")
))
d <- d %>%
mutate(
`Reporter ISO` = case_when(
`Reporter ISO` %in% c(NA, "", " ") ~ "0-unspecified",
TRUE ~ `Reporter ISO`
),
`Partner ISO` = case_when(
`Partner ISO` %in% c(NA, "", " ") ~ "0-unspecified",
TRUE ~ `Partner ISO`
)
)
# d %>%
# select(Year, `Reporter ISO`, `Partner ISO`) %>%
# distinct() %>%
# dim()
d %>%
group_by(Year, `Reporter ISO`) %>%
write_dataset("parquet", hive_style = F, max_partitions = 1024L)
{code}
But, if I add an additional column for partioning and increases the max
partitions to 12808 (to pass exactly the number of partitions that it needs), I
get the error:
{code:java}
d %>%
group_by(Year, `Reporter ISO`) %>%
write_dataset("parquet", hive_style = F, max_partitions = 12808)
Error: IOError: Failed to open local file
'/media/pacha/pacha_backup/tradestatistics/yearly-datasets-arrow/01-raw-data/sitc-rev2/parquet/2019/SEN/MOZ/part-5353.parquet'.
Detail: [errno 24] Too many open files
{code}
> [R][C++] Arrow opens too many files at once when writing a dataset
> ------------------------------------------------------------------
>
> Key: ARROW-12321
> URL: https://issues.apache.org/jira/browse/ARROW-12321
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++, R
> Affects Versions: 3.0.0
> Reporter: Mauricio 'Pachá' Vargas Sepúlveda
> Assignee: Weston Pace
> Priority: Major
> Fix For: 5.0.0
>
>
> _Related to:_ https://issues.apache.org/jira/browse/ARROW-12315
> Please see
> https://drive.google.com/drive/folders/1e7WB36FPYzvdtm46dgAEEFAKAWDQs-e1?usp=sharing
> where I added the raw data and the output.
> This works:
> {code:java}
> library(data.table)
> library(dplyr)
> library(arrow)
> d <- fread(
> input =
> "01-raw-data/sitc-rev2/parquet/type-C_r-ALL_ps-2019_freq-A_px-S2_pub-20210216_fmt-csv_ex-20210227.csv",
> colClasses = list(
> character = "Commodity Code",
> numeric = c("Trade Value (US$)", "Qty", "Netweight (kg)")
> ))
> d <- d %>%
> mutate(
> `Reporter ISO` = case_when(
> `Reporter ISO` %in% c(NA, "", " ") ~ "0-unspecified",
> TRUE ~ `Reporter ISO`
> ),
> `Partner ISO` = case_when(
> `Partner ISO` %in% c(NA, "", " ") ~ "0-unspecified",
> TRUE ~ `Partner ISO`
> )
> )
> # d %>%
> # select(Year, `Reporter ISO`, `Partner ISO`) %>%
> # distinct() %>%
> # dim()
> d %>%
> group_by(Year, `Reporter ISO`) %>%
> write_dataset("parquet", hive_style = F, max_partitions = 1024L)
> {code}
> But, if I add an additional column for partioning and increases the max
> partitions to 12808 (to pass exactly the number of partitions that it needs),
> I get the error:
> {code:java}
> d %>%
> group_by(Year, `Reporter ISO`) %>%
> write_dataset("parquet", hive_style = F, max_partitions = 12808)
> Error: IOError: Failed to open local file
> '/media/pacha/pacha_backup/tradestatistics/yearly-datasets-arrow/01-raw-data/sitc-rev2/parquet/2019/SEN/MOZ/part-5353.parquet'.
> Detail: [errno 24] Too many open files
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)