[ 
https://issues.apache.org/jira/browse/ARROW-9903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17190369#comment-17190369
 ] 

Neal Richardson commented on ARROW-9903:
----------------------------------------

Ok, so {{open_dataset()}} itself isn't hanging, but after querying/scanning the 
dataset some number of times, the query stops responding. I'm not sure why that 
is, and these problems are difficult to debug, especially on Windows. 

It looks like you're essentially trying to partition the dataset into separate 
chunks by {{id_col}} and do work on those separately. The new, not-yet-released 
{{write_dataset()}} function lets you write a dataset with files partitioned 
however you want, so that would simplify your dataset queries and could work 
around whatever issue you're hitting here. 

If you're interested in trying it out, install a nightly dev package with

{code}
install.packages("arrow", repos = "https://arrow-r-nightly.s3.amazonaws.com";)
{code}

and see 
https://ursalabs.org/arrow-r-nightly/articles/dataset.html#writing-datasets for 
examples of how to use it.  

> [R] open_dataset freezes opening feather files
> ----------------------------------------------
>
>                 Key: ARROW-9903
>                 URL: https://issues.apache.org/jira/browse/ARROW-9903
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: R
>         Environment: Rstudio
>            Reporter: Sean Clement
>            Priority: Major
>
> Session info:
> {code:java}
> // R version 4.0.2 (2020-06-22)
> Platform: x86_64-w64-mingw32/x64 (64-bit)
> Running under: Windows 10 x64 (build 19041)Matrix products: defaultlocale:
> [1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United 
> States.1252   
> [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                       
>    
> [5] LC_TIME=English_United States.1252    attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base     
> other attached packages:
>  [1] forcats_0.5.0   stringr_1.4.0   dplyr_1.0.1     purrr_0.3.4     
> readr_1.3.1     tidyr_1.1.1    
>  [7] tibble_3.0.3    ggplot2_3.3.2   tidyverse_1.3.0 arrow_1.0.1    loaded 
> via a namespace (and not attached):
>  [1] Rcpp_1.0.5       cellranger_1.1.0 pillar_1.4.6     compiler_4.0.2   
> dbplyr_1.4.4     tools_4.0.2     
>  [7] bit_1.1-15.2     lubridate_1.7.9  jsonlite_1.7.0   lifecycle_0.2.0  
> gtable_0.3.0     pkgconfig_2.0.3 
> [13] rlang_0.4.7      reprex_0.3.0     cli_2.0.2        DBI_1.1.0        
> rstudioapi_0.11  haven_2.3.1     
> [19] withr_2.2.0      xml2_1.3.2       httr_1.4.2       fs_1.4.1         
> generics_0.0.2   vctrs_0.3.2     
> [25] hms_0.5.3        bit64_0.9-7      grid_4.0.2       tidyselect_1.1.0 
> glue_1.4.1       R6_2.4.1        
> [31] fansi_0.4.1      readxl_1.3.1     modelr_0.1.8     blob_1.2.1       
> magrittr_1.5     backports_1.1.7 
> [37] scales_1.1.1     ellipsis_0.3.1   rvest_0.3.5      assertthat_0.2.1 
> colorspace_1.4-1 stringi_1.4.6   
> [43] munsell_0.5.0    broom_0.7.0      crayon_1.3.4    
> {code}
> While cycling through and processing files using open_dataset(..., format = 
> "feather") in R, the function hangs randomly and will not proceed to the next 
> file. The freeze does not appear at the same file each time, additionally, 
> the same function freezes when used one on occasion. 
> When open_dataset hangs the only way to get R to stop is using Task Manager 
> as Rstudio becomes totally unresponsive. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to