paleolimbot commented on issue #37302:
URL: https://github.com/apache/arrow/issues/37302#issuecomment-1688103216

   I think this is a peculiarity of our `dim.arrow_dplyr_query()` 
implementation, which uses `Scanner$CountRows()`. For example, a regular 
`collect()` works even though `dim()` doesn't:
   
   ``` r
   library(arrow, warn.conflicts = FALSE)
   #> Some features are not enabled in this build of Arrow. Run `arrow_info()` 
for more information.
   library(dplyr, warn.conflicts = FALSE)
   
   tf <- tempfile()
   dir.create(tf)
   write_dataset(group_by(mtcars, am), tf)
   
   # fine?
   open_dataset(tf) |>
     filter(cyl == 6) %>%
     to_duckdb() %>%
     mutate(mean_hp = mean(hp)) %>%
     to_arrow() %>%
     filter(hp < mean_hp) %>%
     collect()
   #> Warning: Missing values are always removed in SQL aggregation functions.
   #> Use `na.rm = TRUE` to silence this warning
   #> This warning is displayed once every 8 hours.
   #> # A tibble: 4 × 12
   #>     mpg   cyl  disp    hp  drat    wt  qsec    vs  gear  carb    am 
mean_hp
   #>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>   
<dbl>
   #> 1  21.4     6   258   110  3.08  3.22  19.4     1     3     1     0    
122.
   #> 2  18.1     6   225   105  2.76  3.46  20.2     1     3     1     0    
122.
   #> 3  21       6   160   110  3.9   2.62  16.5     0     4     4     1    
122.
   #> 4  21       6   160   110  3.9   2.88  17.0     0     4     4     1    
122.
   
   # fine
   open_dataset(tf) |>
     filter(cyl == 6) %>%
     to_duckdb() %>%
     mutate(mean_hp = mean(hp)) %>%
     to_arrow() %>%
     filter(hp < mean_hp) %>%
     collect()
   #> # A tibble: 4 × 12
   #>     mpg   cyl  disp    hp  drat    wt  qsec    vs  gear  carb    am 
mean_hp
   #>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>   
<dbl>
   #> 1  21.4     6   258   110  3.08  3.22  19.4     1     3     1     0    
122.
   #> 2  18.1     6   225   105  2.76  3.46  20.2     1     3     1     0    
122.
   #> 3  21       6   160   110  3.9   2.62  16.5     0     4     4     1    
122.
   #> 4  21       6   160   110  3.9   2.88  17.0     0     4     4     1    
122.
   
   # error
   open_dataset(tf) |>
     filter(cyl == 6) %>%
     to_duckdb() %>%
     mutate(mean_hp = mean(hp)) %>%
     to_arrow() %>%
     filter(hp < mean_hp) %>%
     dim()
   #> Error: NotImplemented: Call to R (SafeRecordBatchReader::ReadNext()) from 
a non-R thread from an unsupported context
   ```
   
   The error traceback:
   
   ```
   Error: NotImplemented: Call to R (SafeRecordBatchReader::ReadNext()) from a 
non-R thread from an unsupported context
   dataset___Scanner__CountRows(self) at dataset-scan.R#85
   Scanner$create(x)$CountRows() at dplyr.R#186
   dim.arrow_dplyr_query(x)
   dim(x)
   nrow(.)
   ```
   
   The workaround would be to use `count()` and `pull(n)`. This works because 
executing an exec plan is one of the "supported contexts" for `SafeCallIntoR()` 
(calling a `Scanner` method is not, and probably shouldn't be since as far as I 
know the Scanner methods are all implementable using an exec plan).
   
   ``` r
   library(arrow, warn.conflicts = FALSE)
   #> Some features are not enabled in this build of Arrow. Run `arrow_info()` 
for more information.
   library(dplyr, warn.conflicts = FALSE)
   
   tf <- tempfile()
   dir.create(tf)
   write_dataset(group_by(mtcars, am), tf)
   
   open_dataset(tf) |>
     filter(cyl == 6) %>%
     to_duckdb() %>%
     mutate(mean_hp = mean(hp)) %>%
     to_arrow() %>%
     filter(hp < mean_hp) %>%
     count() |> 
     pull(n)
   #> Warning: Missing values are always removed in SQL aggregation functions.
   #> Use `na.rm = TRUE` to silence this warning
   #> This warning is displayed once every 8 hours.
   #> Warning: Default behavior of `pull()` on Arrow data is changing. Current 
behavior of returning an R vector is deprecated, and in a future release, it 
will return an Arrow `ChunkedArray`. To control this:
   #> ℹ Specify `as_vector = TRUE` (the current default) or `FALSE` (what it 
will change to) in `pull()`
   #> ℹ Or, set `options(arrow.pull_as_vector)` globally
   #> This warning is displayed once every 8 hours.
   #> [1] 4
   ```
   
   A more permanent solution would be to reimplement `dim()` for a dplyr query 
using an exec plan.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to