[jira] [Updated] (ARROW-12321) [R][C++] Arrow opens too many files at once when writing a dataset

Jira Fri, 09 Apr 2021 15:46:04 -0700


     [ 
https://issues.apache.org/jira/browse/ARROW-12321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Mauricio 'Pachá' Vargas Sepúlveda updated ARROW-12321:
------------------------------------------------------
    Description: 
_Related to:_ https://issues.apache.org/jira/browse/ARROW-12315

Please see 
https://drive.google.com/drive/folders/1e7WB36FPYzvdtm46dgAEEFAKAWDQs-e1?usp=sharing
 where I added the raw data and the output.

This works:

{code:java}

library(data.table)
library(dplyr)
library(arrow)

d <- fread(
        input = 
"01-raw-data/sitc-rev2/parquet/type-C_r-ALL_ps-2019_freq-A_px-S2_pub-20210216_fmt-csv_ex-20210227.csv",
        colClasses = list(
          character = "Commodity Code",
          numeric = c("Trade Value (US$)", "Qty", "Netweight (kg)")
        ))

d <- d %>%
  mutate(
    `Reporter ISO` = case_when(
      `Reporter ISO` %in% c(NA, "", " ") ~ "0-unspecified",
      TRUE ~ `Reporter ISO`
    ),
    `Partner ISO` = case_when(
      `Partner ISO` %in% c(NA, "", " ") ~ "0-unspecified",
      TRUE ~ `Partner ISO`
    )
  )

# d %>%
#   select(Year, `Reporter ISO`, `Partner ISO`) %>%
#   distinct() %>%
#   dim()

d %>%
  group_by(Year, `Reporter ISO`) %>%
  write_dataset("parquet", hive_style = F, max_partitions = 1024L)
{code}

But, if I add an additional column for partioning and increases the max 
partitions to 12808 (to pass exactly the number of partitions that it needs), I 
get the error:

{code:java}
d %>%
  group_by(Year, `Reporter ISO`) %>%
  write_dataset("parquet", hive_style = F, max_partitions = 12808)

Error: IOError: Failed to open local file 
'/media/pacha/pacha_backup/tradestatistics/yearly-datasets-arrow/01-raw-data/sitc-rev2/parquet/2019/SEN/MOZ/part-5353.parquet'.
 Detail: [errno 24] Too many open files
{code}





  was:
_Related to https://issues.apache.org/jira/browse/ARROW-12315_

Please see 
https://drive.google.com/drive/folders/1e7WB36FPYzvdtm46dgAEEFAKAWDQs-e1?usp=sharing
 where I added the raw data and the output.

This works:

{code:java}

library(data.table)
library(dplyr)
library(arrow)

d <- fread(
        input = 
"01-raw-data/sitc-rev2/parquet/type-C_r-ALL_ps-2019_freq-A_px-S2_pub-20210216_fmt-csv_ex-20210227.csv",
        colClasses = list(
          character = "Commodity Code",
          numeric = c("Trade Value (US$)", "Qty", "Netweight (kg)")
        ))

d <- d %>%
  mutate(
    `Reporter ISO` = case_when(
      `Reporter ISO` %in% c(NA, "", " ") ~ "0-unspecified",
      TRUE ~ `Reporter ISO`
    ),
    `Partner ISO` = case_when(
      `Partner ISO` %in% c(NA, "", " ") ~ "0-unspecified",
      TRUE ~ `Partner ISO`
    )
  )

# d %>%
#   select(Year, `Reporter ISO`, `Partner ISO`) %>%
#   distinct() %>%
#   dim()

d %>%
  group_by(Year, `Reporter ISO`) %>%
  write_dataset("parquet", hive_style = F, max_partitions = 1024L)
{code}

But, if I add an additional column for partioning and increases the max 
partitions to 12808 (to pass exactly the number of partitions that it needs), I 
get the error:

{code:java}
d %>%
  group_by(Year, `Reporter ISO`) %>%
  write_dataset("parquet", hive_style = F, max_partitions = 12808)

Error: IOError: Failed to open local file 
'/media/pacha/pacha_backup/tradestatistics/yearly-datasets-arrow/01-raw-data/sitc-rev2/parquet/2019/SEN/MOZ/part-5353.parquet'.
 Detail: [errno 24] Too many open files
{code}






> [R][C++] Arrow opens too many files at once when writing a dataset
> ------------------------------------------------------------------
>
>                 Key: ARROW-12321
>                 URL: https://issues.apache.org/jira/browse/ARROW-12321
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, R
>    Affects Versions: 3.0.0
>            Reporter: Mauricio 'Pachá' Vargas Sepúlveda
>            Assignee: Weston Pace
>            Priority: Major
>             Fix For: 5.0.0
>
>
> _Related to:_ https://issues.apache.org/jira/browse/ARROW-12315
> Please see 
> https://drive.google.com/drive/folders/1e7WB36FPYzvdtm46dgAEEFAKAWDQs-e1?usp=sharing
>  where I added the raw data and the output.
> This works:
> {code:java}
> library(data.table)
> library(dplyr)
> library(arrow)
> d <- fread(
>         input = 
> "01-raw-data/sitc-rev2/parquet/type-C_r-ALL_ps-2019_freq-A_px-S2_pub-20210216_fmt-csv_ex-20210227.csv",
>         colClasses = list(
>           character = "Commodity Code",
>           numeric = c("Trade Value (US$)", "Qty", "Netweight (kg)")
>         ))
> d <- d %>%
>   mutate(
>     `Reporter ISO` = case_when(
>       `Reporter ISO` %in% c(NA, "", " ") ~ "0-unspecified",
>       TRUE ~ `Reporter ISO`
>     ),
>     `Partner ISO` = case_when(
>       `Partner ISO` %in% c(NA, "", " ") ~ "0-unspecified",
>       TRUE ~ `Partner ISO`
>     )
>   )
> # d %>%
> #   select(Year, `Reporter ISO`, `Partner ISO`) %>%
> #   distinct() %>%
> #   dim()
> d %>%
>   group_by(Year, `Reporter ISO`) %>%
>   write_dataset("parquet", hive_style = F, max_partitions = 1024L)
> {code}
> But, if I add an additional column for partioning and increases the max 
> partitions to 12808 (to pass exactly the number of partitions that it needs), 
> I get the error:
> {code:java}
> d %>%
>   group_by(Year, `Reporter ISO`) %>%
>   write_dataset("parquet", hive_style = F, max_partitions = 12808)
> Error: IOError: Failed to open local file 
> '/media/pacha/pacha_backup/tradestatistics/yearly-datasets-arrow/01-raw-data/sitc-rev2/parquet/2019/SEN/MOZ/part-5353.parquet'.
>  Detail: [errno 24] Too many open files
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-12321) [R][C++] Arrow opens too many files at once when writing a dataset

Reply via email to