[
https://issues.apache.org/jira/browse/ARROW-9903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17190369#comment-17190369
]
Neal Richardson commented on ARROW-9903:
----------------------------------------
Ok, so {{open_dataset()}} itself isn't hanging, but after querying/scanning the
dataset some number of times, the query stops responding. I'm not sure why that
is, and these problems are difficult to debug, especially on Windows.
It looks like you're essentially trying to partition the dataset into separate
chunks by {{id_col}} and do work on those separately. The new, not-yet-released
{{write_dataset()}} function lets you write a dataset with files partitioned
however you want, so that would simplify your dataset queries and could work
around whatever issue you're hitting here.
If you're interested in trying it out, install a nightly dev package with
{code}
install.packages("arrow", repos = "https://arrow-r-nightly.s3.amazonaws.com")
{code}
and see
https://ursalabs.org/arrow-r-nightly/articles/dataset.html#writing-datasets for
examples of how to use it.
> [R] open_dataset freezes opening feather files
> ----------------------------------------------
>
> Key: ARROW-9903
> URL: https://issues.apache.org/jira/browse/ARROW-9903
> Project: Apache Arrow
> Issue Type: Bug
> Components: R
> Environment: Rstudio
> Reporter: Sean Clement
> Priority: Major
>
> Session info:
> {code:java}
> // R version 4.0.2 (2020-06-22)
> Platform: x86_64-w64-mingw32/x64 (64-bit)
> Running under: Windows 10 x64 (build 19041)Matrix products: defaultlocale:
> [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United
> States.1252
> [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
>
> [5] LC_TIME=English_United States.1252 attached base packages:
> [1] stats graphics grDevices utils datasets methods base
> other attached packages:
> [1] forcats_0.5.0 stringr_1.4.0 dplyr_1.0.1 purrr_0.3.4
> readr_1.3.1 tidyr_1.1.1
> [7] tibble_3.0.3 ggplot2_3.3.2 tidyverse_1.3.0 arrow_1.0.1 loaded
> via a namespace (and not attached):
> [1] Rcpp_1.0.5 cellranger_1.1.0 pillar_1.4.6 compiler_4.0.2
> dbplyr_1.4.4 tools_4.0.2
> [7] bit_1.1-15.2 lubridate_1.7.9 jsonlite_1.7.0 lifecycle_0.2.0
> gtable_0.3.0 pkgconfig_2.0.3
> [13] rlang_0.4.7 reprex_0.3.0 cli_2.0.2 DBI_1.1.0
> rstudioapi_0.11 haven_2.3.1
> [19] withr_2.2.0 xml2_1.3.2 httr_1.4.2 fs_1.4.1
> generics_0.0.2 vctrs_0.3.2
> [25] hms_0.5.3 bit64_0.9-7 grid_4.0.2 tidyselect_1.1.0
> glue_1.4.1 R6_2.4.1
> [31] fansi_0.4.1 readxl_1.3.1 modelr_0.1.8 blob_1.2.1
> magrittr_1.5 backports_1.1.7
> [37] scales_1.1.1 ellipsis_0.3.1 rvest_0.3.5 assertthat_0.2.1
> colorspace_1.4-1 stringi_1.4.6
> [43] munsell_0.5.0 broom_0.7.0 crayon_1.3.4
> {code}
> While cycling through and processing files using open_dataset(..., format =
> "feather") in R, the function hangs randomly and will not proceed to the next
> file. The freeze does not appear at the same file each time, additionally,
> the same function freezes when used one on occasion.
> When open_dataset hangs the only way to get R to stop is using Task Manager
> as Rstudio becomes totally unresponsive.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)