[
https://issues.apache.org/jira/browse/ARROW-9903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17190353#comment-17190353
]
Sean Clement commented on ARROW-9903:
-------------------------------------
I know it's freezing on different files because I was having it produce a batch
report.
{code:java}
// Example
ds <-
open_dataset(
"F:/Test/Feather Files/",
format = "feather"
)
id_keys <-
ds %>%
select(id_col) %>%
collect() %>%
unique()
for(i in 1:nrow(id_keys)){
output <-
ds %>%
filter(id_col == id_keys$id_col[i]) %>%
collect()
#report processing here
data.table::fwrite(output, paste0("output_file_", id_keys$id_col[i]))
}{code}
When running this batch, the "output_file_" last produced when encountering a
freeze changes each time I run the batch. The files process correctly and when
open_dataset doesn't freeze they process extremely fast. But open_dataset hangs
in an unpredictable manner.
> [R] open_dataset freezes opening feather files
> ----------------------------------------------
>
> Key: ARROW-9903
> URL: https://issues.apache.org/jira/browse/ARROW-9903
> Project: Apache Arrow
> Issue Type: Bug
> Components: R
> Environment: Rstudio
> Reporter: Sean Clement
> Priority: Major
>
> Session info:
> {code:java}
> // R version 4.0.2 (2020-06-22)
> Platform: x86_64-w64-mingw32/x64 (64-bit)
> Running under: Windows 10 x64 (build 19041)Matrix products: defaultlocale:
> [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United
> States.1252
> [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
>
> [5] LC_TIME=English_United States.1252 attached base packages:
> [1] stats graphics grDevices utils datasets methods base
> other attached packages:
> [1] forcats_0.5.0 stringr_1.4.0 dplyr_1.0.1 purrr_0.3.4
> readr_1.3.1 tidyr_1.1.1
> [7] tibble_3.0.3 ggplot2_3.3.2 tidyverse_1.3.0 arrow_1.0.1 loaded
> via a namespace (and not attached):
> [1] Rcpp_1.0.5 cellranger_1.1.0 pillar_1.4.6 compiler_4.0.2
> dbplyr_1.4.4 tools_4.0.2
> [7] bit_1.1-15.2 lubridate_1.7.9 jsonlite_1.7.0 lifecycle_0.2.0
> gtable_0.3.0 pkgconfig_2.0.3
> [13] rlang_0.4.7 reprex_0.3.0 cli_2.0.2 DBI_1.1.0
> rstudioapi_0.11 haven_2.3.1
> [19] withr_2.2.0 xml2_1.3.2 httr_1.4.2 fs_1.4.1
> generics_0.0.2 vctrs_0.3.2
> [25] hms_0.5.3 bit64_0.9-7 grid_4.0.2 tidyselect_1.1.0
> glue_1.4.1 R6_2.4.1
> [31] fansi_0.4.1 readxl_1.3.1 modelr_0.1.8 blob_1.2.1
> magrittr_1.5 backports_1.1.7
> [37] scales_1.1.1 ellipsis_0.3.1 rvest_0.3.5 assertthat_0.2.1
> colorspace_1.4-1 stringi_1.4.6
> [43] munsell_0.5.0 broom_0.7.0 crayon_1.3.4
> {code}
> While cycling through and processing files using open_dataset(..., format =
> "feather") in R, the function hangs randomly and will not proceed to the next
> file. The freeze does not appear at the same file each time, additionally,
> the same function freezes when used one on occasion.
> When open_dataset hangs the only way to get R to stop is using Task Manager
> as Rstudio becomes totally unresponsive.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)