[
https://issues.apache.org/jira/browse/ARROW-10773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17241950#comment-17241950
]
Bruno Tremblay commented on ARROW-10773:
----------------------------------------
Here is the fun part.
The Table is built from a single vector of RAWSXP reprensenting an IPC stream.
When this raw vector is saved to disk using saveRDS then reread using readRDS,
the resulting Table has no problem behind converted to a data.frame even with
multithreading.
It is only in the case where the vector stays in memory that the problem occurs
on multi-threading.
Mind you building the Table itself is not an issue and querying the table for
everyrow also yield the expected results.
It's pretty hard for me to nail the problem down as I do not have any notion
yet of how threads are handled in Cpp.
But I'm pretty sure it has to do with either memory management or the
length/capacity of the in memory vector.
Next up is doing a memory dump to compare between in-memory only and
memory-disk-memory method.
> [R] parallel as.data.frame.Table hangs indefinitely on Windows
> --------------------------------------------------------------
>
> Key: ARROW-10773
> URL: https://issues.apache.org/jira/browse/ARROW-10773
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++, R
> Affects Versions: 2.0.0
> Reporter: Bruno Tremblay
> Priority: Minor
>
> On Windows only
> Tested on 2 machines, mingw.
> Reprex
> {code:java}
> install.packages("arrow", repos = "https://arrow-r-nightly.s3.amazonaws.com")
> remotes::install_github("meztez/bigrquery", ref = "bigquerystorage",
> INSTALL_opts = "--no-multiarch")
> library(bigrquery)
> Sys.info()
> sessionInfo()
> Sys.setenv("BIGQUERY_TEST_PROJECT"="{project}")con <- bigrquery::dbConnect(
> bigrquery::bigquery(),
> project = "bigquery-public-data",
> dataset = "usa_names",
> billing = bigrquery:::bq_test_project())
> # Does not hang
> options(arrow.use_threads = FALSE)
> dt <- DBI::dbReadTable(con,
> "bigquery-public-data.usa_names.usa_1910_current", bqs = TRUE)
> # Hangs
> options(arrow.use_threads = TRUE)
> dt <- DBI::dbReadTable(con,
> "bigquery-public-data.usa_names.usa_1910_current", bqs = TRUE){code}
>
> Session details
>
> {code:java}
> > Sys.info()
> sysname release version nodename machine
> login user effective_user
> "Windows" "10 x64" "build 19042" "C000055787" "x86-64"
> "gen01914" "gen01914" "gen01914"
> > sessionInfo()
> R version 4.0.3 (2020-10-10)
> Platform: x86_64-w64-mingw32/x64 (64-bit)
> Running under: Windows 10 x64 (build 19042)
> Matrix products: default
> locale:
> [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United
> States.1252 LC_MONETARY=English_United States.1252
> [4] LC_NUMERIC=C LC_TIME=English_United States.1252
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
> other attached packages:
> [1] bigrquery_1.3.2.9001
> loaded via a namespace (and not attached):
> [1] Rcpp_1.0.5 rstudioapi_0.13 magrittr_1.5
> tidyselect_1.1.0 bit_4.0.4 R6_2.5.0
> [7] rlang_0.4.8 dplyr_1.0.2 httr_1.4.2
> tools_4.0.3 arrow_2.0.0.20201130 DBI_1.1.0
> [13] dbplyr_2.0.0 ellipsis_0.3.1 remotes_2.2.0
> bit64_4.0.5 assertthat_0.2.1 gargle_0.5.0
> [19] tibble_3.0.4 lifecycle_0.2.0 crayon_1.3.4
> purrr_0.3.4 fs_1.5.0 vctrs_0.3.4
> [25] glue_1.4.2 compiler_4.0.3 pillar_1.4.6
> generics_0.1.0 jsonlite_1.7.1 pkgconfig_2.0.3
> {code}
>
> ```
--
This message was sent by Atlassian Jira
(v8.3.4#803005)