paleolimbot commented on issue #37302:
URL: https://github.com/apache/arrow/issues/37302#issuecomment-1688103216
I think this is a peculiarity of our `dim.arrow_dplyr_query()`
implementation, which uses `Scanner$CountRows()`. For example, a regular
`collect()` works even though `dim()` doesn't:
``` r
library(arrow, warn.conflicts = FALSE)
#> Some features are not enabled in this build of Arrow. Run `arrow_info()`
for more information.
library(dplyr, warn.conflicts = FALSE)
tf <- tempfile()
dir.create(tf)
write_dataset(group_by(mtcars, am), tf)
# fine?
open_dataset(tf) |>
filter(cyl == 6) %>%
to_duckdb() %>%
mutate(mean_hp = mean(hp)) %>%
to_arrow() %>%
filter(hp < mean_hp) %>%
collect()
#> Warning: Missing values are always removed in SQL aggregation functions.
#> Use `na.rm = TRUE` to silence this warning
#> This warning is displayed once every 8 hours.
#> # A tibble: 4 × 12
#> mpg cyl disp hp drat wt qsec vs gear carb am
mean_hp
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
<dbl>
#> 1 21.4 6 258 110 3.08 3.22 19.4 1 3 1 0
122.
#> 2 18.1 6 225 105 2.76 3.46 20.2 1 3 1 0
122.
#> 3 21 6 160 110 3.9 2.62 16.5 0 4 4 1
122.
#> 4 21 6 160 110 3.9 2.88 17.0 0 4 4 1
122.
# fine
open_dataset(tf) |>
filter(cyl == 6) %>%
to_duckdb() %>%
mutate(mean_hp = mean(hp)) %>%
to_arrow() %>%
filter(hp < mean_hp) %>%
collect()
#> # A tibble: 4 × 12
#> mpg cyl disp hp drat wt qsec vs gear carb am
mean_hp
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
<dbl>
#> 1 21.4 6 258 110 3.08 3.22 19.4 1 3 1 0
122.
#> 2 18.1 6 225 105 2.76 3.46 20.2 1 3 1 0
122.
#> 3 21 6 160 110 3.9 2.62 16.5 0 4 4 1
122.
#> 4 21 6 160 110 3.9 2.88 17.0 0 4 4 1
122.
# error
open_dataset(tf) |>
filter(cyl == 6) %>%
to_duckdb() %>%
mutate(mean_hp = mean(hp)) %>%
to_arrow() %>%
filter(hp < mean_hp) %>%
dim()
#> Error: NotImplemented: Call to R (SafeRecordBatchReader::ReadNext()) from
a non-R thread from an unsupported context
```
The error traceback:
```
Error: NotImplemented: Call to R (SafeRecordBatchReader::ReadNext()) from a
non-R thread from an unsupported context
dataset___Scanner__CountRows(self) at dataset-scan.R#85
Scanner$create(x)$CountRows() at dplyr.R#186
dim.arrow_dplyr_query(x)
dim(x)
nrow(.)
```
The workaround would be to use `count()` and `pull(n)`. This works because
executing an exec plan is one of the "supported contexts" for `SafeCallIntoR()`
(calling a `Scanner` method is not, and probably shouldn't be since as far as I
know the Scanner methods are all implementable using an exec plan).
``` r
library(arrow, warn.conflicts = FALSE)
#> Some features are not enabled in this build of Arrow. Run `arrow_info()`
for more information.
library(dplyr, warn.conflicts = FALSE)
tf <- tempfile()
dir.create(tf)
write_dataset(group_by(mtcars, am), tf)
open_dataset(tf) |>
filter(cyl == 6) %>%
to_duckdb() %>%
mutate(mean_hp = mean(hp)) %>%
to_arrow() %>%
filter(hp < mean_hp) %>%
count() |>
pull(n)
#> Warning: Missing values are always removed in SQL aggregation functions.
#> Use `na.rm = TRUE` to silence this warning
#> This warning is displayed once every 8 hours.
#> Warning: Default behavior of `pull()` on Arrow data is changing. Current
behavior of returning an R vector is deprecated, and in a future release, it
will return an Arrow `ChunkedArray`. To control this:
#> ℹ Specify `as_vector = TRUE` (the current default) or `FALSE` (what it
will change to) in `pull()`
#> ℹ Or, set `options(arrow.pull_as_vector)` globally
#> This warning is displayed once every 8 hours.
#> [1] 4
```
A more permanent solution would be to reimplement `dim()` for a dplyr query
using an exec plan.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]