twedl commented on issue #45139:
URL: https://github.com/apache/arrow/issues/45139#issuecomment-2631452887
To add some context and detail to this bug to help us along:
``` r
library(tidyverse)
library(arrow)
tbl <- tibble(int = 1:2)
tab <- arrow_table(tbl)
# dplyr version of slice_sample:
dplyr_res <- replicate(
1000,
tbl |> slice_sample(n=1),
simplify = FALSE
) |>
list_rbind()
# arrow version:
arrow_res <- replicate(
1000,
tab |> slice_sample(n = 1) |> collect(),
simplify = FALSE
) |>
list_rbind()
# expect close to 0.5 for each
table(dplyr_res$int) / nrow(dplyr_res)
#>
#> 1 2
#> 0.522 0.478
table(arrow_res$int) / nrow(arrow_res)
#>
#> 1 2
#> 0.6902107 0.3097893
# and 1000 results
nrow(dplyr_res)
#> [1] 1000
nrow(arrow_res)
#> [1] 807
```
<sup>Created on 2025-02-03 with [reprex
v2.1.1](https://reprex.tidyverse.org)</sup>
<details style="margin-bottom:10px;">
<summary>
Session info
</summary>
``` r
sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: aarch64-apple-darwin20
#> Running under: macOS Sequoia 15.1.1
#>
#> Matrix products: default
#> BLAS:
/Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
#> LAPACK:
/Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;
LAPACK version 3.12.0
#>
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#>
#> time zone: America/Halifax
#> tzcode source: internal
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] arrow_19.0.0.9000 lubridate_1.9.3 forcats_1.0.0 stringr_1.5.1
#> [5] dplyr_1.1.4 purrr_1.0.2 readr_2.1.5 tidyr_1.3.1
#> [9] tibble_3.2.1 ggplot2_3.5.1 tidyverse_2.0.0
#>
#> loaded via a namespace (and not attached):
#> [1] bit_4.5.0 gtable_0.3.6 compiler_4.4.2 reprex_2.1.1
#> [5] tidyselect_1.2.1 assertthat_0.2.1 scales_1.3.0 yaml_2.3.10
#> [9] fastmap_1.2.0 R6_2.5.1 generics_0.1.3 knitr_1.49
#> [13] munsell_0.5.1 pillar_1.9.0 tzdb_0.4.0 rlang_1.1.4
#> [17] utf8_1.2.4 stringi_1.8.4 xfun_0.49 fs_1.6.5
#> [21] bit64_4.5.2 timechange_0.3.0 cli_3.6.3 withr_3.0.2
#> [25] magrittr_2.0.3 digest_0.6.37 grid_4.4.2
rstudioapi_0.17.1
#> [29] hms_1.1.3 lifecycle_1.0.4 vctrs_0.6.5 evaluate_1.0.1
#> [33] glue_1.8.0 fansi_1.0.6 colorspace_2.1-1 rmarkdown_2.29
#> [37] tools_4.4.2 pkgconfig_2.0.3 htmltools_0.5.8.1
```
</details>
It boils down to these two lines in `slice_sample.arrow_dplyr_query`:
https://github.com/apache/arrow/blob/ddc4bdeb1b5bbb2511aecd97b5aa8ffc6c092d7c/r/R/dplyr-slice.R#L104
And:
https://github.com/apache/arrow/blob/ddc4bdeb1b5bbb2511aecd97b5aa8ffc6c092d7c/r/R/dplyr-slice.R#L132
The current strategy is to convert `n` to `prop`, get answer via the `prop`
version, then convert back to `n` by applying `head`, which will end up picking
the first `n` rows that pass the prop filter. In @blongworth's example, it'll
pick the first 10 out of the ~260 random rows selected out of 5000 (the 260
comes from L104 above: `10 + 0.05 * 5000`). On top of that, the prop filter
sometimes won't select enough rows, resulting in some empty samples (`n<1`, in
my example).
Swapping `head` to `sample` in L132 above should fix the reported bug for
now. But someone might want to improve the sampling methodology altogether;
e.g., referenced by the comment:
https://github.com/apache/arrow/blob/ddc4bdeb1b5bbb2511aecd97b5aa8ffc6c092d7c/r/R/dplyr-slice.R#L110
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]