[ 
https://issues.apache.org/jira/browse/ARROW-10773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17241950#comment-17241950
 ] 

Bruno Tremblay commented on ARROW-10773:
----------------------------------------

Here is the fun part.

The Table is built from a single vector of RAWSXP reprensenting an IPC stream.

When this raw vector is saved to disk using saveRDS then reread using readRDS, 
the resulting Table has no problem behind converted to a data.frame even with 
multithreading.

It is only in the case where the vector stays in memory that the problem occurs 
on multi-threading.

Mind you building the Table itself is not an issue and querying the table for 
everyrow also yield the expected results.

It's pretty hard for me to nail the problem down as I do not have any notion 
yet of how threads are handled in Cpp.

 

But I'm pretty sure it has to do with either memory management or the 
length/capacity of the in memory vector. 

 

Next up is doing a memory dump to compare between in-memory only and 
memory-disk-memory method.

> [R] parallel as.data.frame.Table hangs indefinitely on Windows
> --------------------------------------------------------------
>
>                 Key: ARROW-10773
>                 URL: https://issues.apache.org/jira/browse/ARROW-10773
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, R
>    Affects Versions: 2.0.0
>            Reporter: Bruno Tremblay
>            Priority: Minor
>
> On Windows only
> Tested on 2 machines, mingw. 
> Reprex
> {code:java}
> install.packages("arrow", repos = "https://arrow-r-nightly.s3.amazonaws.com";)
> remotes::install_github("meztez/bigrquery", ref = "bigquerystorage", 
> INSTALL_opts = "--no-multiarch")
> library(bigrquery)
> Sys.info()
> sessionInfo()
> Sys.setenv("BIGQUERY_TEST_PROJECT"="{project}")con <- bigrquery::dbConnect(
>   bigrquery::bigquery(),
>   project = "bigquery-public-data",
>   dataset = "usa_names",
>   billing = bigrquery:::bq_test_project())
> # Does not hang
> options(arrow.use_threads = FALSE)
> dt <- DBI::dbReadTable(con, 
> "bigquery-public-data.usa_names.usa_1910_current", bqs = TRUE)
> # Hangs
> options(arrow.use_threads = TRUE)
> dt <- DBI::dbReadTable(con, 
> "bigquery-public-data.usa_names.usa_1910_current", bqs = TRUE){code}
>  
> Session details
>  
> {code:java}
> > Sys.info()
>        sysname        release        version       nodename        machine    
>       login           user effective_user 
>      "Windows"       "10 x64"  "build 19042"   "C000055787"       "x86-64"    
>  "gen01914"     "gen01914"     "gen01914" 
> > sessionInfo()
> R version 4.0.3 (2020-10-10)
> Platform: x86_64-w64-mingw32/x64 (64-bit)
> Running under: Windows 10 x64 (build 19042)
> Matrix products: default
> locale:
> [1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United 
> States.1252    LC_MONETARY=English_United States.1252
> [4] LC_NUMERIC=C                           LC_TIME=English_United States.1252 
>    
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base     
> other attached packages:
> [1] bigrquery_1.3.2.9001
> loaded via a namespace (and not attached):
>  [1] Rcpp_1.0.5           rstudioapi_0.13      magrittr_1.5         
> tidyselect_1.1.0     bit_4.0.4            R6_2.5.0            
>  [7] rlang_0.4.8          dplyr_1.0.2          httr_1.4.2           
> tools_4.0.3          arrow_2.0.0.20201130 DBI_1.1.0           
> [13] dbplyr_2.0.0         ellipsis_0.3.1       remotes_2.2.0        
> bit64_4.0.5          assertthat_0.2.1     gargle_0.5.0        
> [19] tibble_3.0.4         lifecycle_0.2.0      crayon_1.3.4         
> purrr_0.3.4          fs_1.5.0             vctrs_0.3.4         
> [25] glue_1.4.2           compiler_4.0.3       pillar_1.4.6         
> generics_0.1.0       jsonlite_1.7.1       pkgconfig_2.0.3  
> {code}
>  
> ```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to