[
https://issues.apache.org/jira/browse/ARROW-13761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17405506#comment-17405506
]
Carl Boettiger commented on ARROW-13761:
----------------------------------------
Thanks all for the explanations. I can confirm that if I request a non-empty
query on the taxi data, I do not get the crash.
As noted above, when attempting the same simple filters on my own local parquet
file though, (which I created using arrow originally), arrow quickly consumes
over 40 GB of RAM and crashes the R client. Should I open a separate issue for
that one?
I think this should reproduce it. apologies for the ~150 GB example file,
haven't figured out how to reproduce this with smaller data (which naturally
don't trigger the OOM)
{code:r}
library(arrow)
library(dplyr)
file <- "part-0.parquet"
download.file("https://minio.cirrus.carlboettiger.info/shared-data/birddb/parquet/part-0.parquet",
file)
ds <- open_dataset(file, format = "parquet")
ds %>% filter(COUNTRY == "Mexico", `COMMON NAME`=="Wood thrush") %>% compute()
{code}
> [R] arrow::filter() crashes (aborts R session)
> ----------------------------------------------
>
> Key: ARROW-13761
> URL: https://issues.apache.org/jira/browse/ARROW-13761
> Project: Apache Arrow
> Issue Type: Bug
> Components: R
> Affects Versions: 5.0.0
> Reporter: Carl Boettiger
> Priority: Major
>
> Arrow crashes (aborts R session) when attempting to evaluate `filter` with a
> `collect()` command, e.g. following arrow's dplyr vignette:
> https://cran.r-project.org/web/packages/arrow/vignettes/dataset.html
> ```r
> library(arrow)
> library(dplyr)
> ds <- open_dataset("nyc-taxi", partitioning = c("year", "month"))
> x <- ds %>%
> filter(total_amount > 100, year == 2015)
> x %>% collect() # crashes R
> ```
> (Note for simplicity I downloaded only years 2009 and 2010 using the R loop
> you provide in the Vignette.
> I observe this behavior in a RStudio server instance on a Ubuntu 20.04 Linux
> server with 128 cores and 256 GB RAM.
> Here's my sessionInfo():
> ```r
> sessionInfo()
> R version 4.1.0 (2021-05-18)
> Platform: x86_64-pc-linux-gnu (64-bit)
> Running under: Ubuntu 20.04.2 LTS
> Matrix products: default
> BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so
> locale:
> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=C
> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
> [9] LC_ADDRESS=C LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
> other attached packages:
> [1] dplyr_1.0.7 arrow_5.0.0
> loaded via a namespace (and not attached):
> [1] fansi_0.5.0 crayon_1.4.1 utf8_1.2.2 assertthat_0.2.1
> [5] R6_2.5.1 DBI_1.1.1 lifecycle_1.0.0 magrittr_2.0.1
> [9] pillar_1.6.2 rlang_0.4.11 vctrs_0.3.8 generics_0.1.0
> [13] ellipsis_0.3.2 tools_4.1.0 bit64_4.0.5 glue_1.4.2
> [17] purrr_0.3.4 bit_4.0.4 compiler_4.1.0 pkgconfig_2.0.3
> [21] tidyselect_1.1.1 tibble_3.1.3
> ```
--
This message was sent by Atlassian Jira
(v8.3.4#803005)