Re: [I] [R] slice_sample does not sample randomly [arrow]

via GitHub Mon, 03 Feb 2025 08:16:15 -0800


twedl commented on issue #45139:
URL: https://github.com/apache/arrow/issues/45139#issuecomment-2631452887


   To add some context and detail to this bug to help us along:
   
   ``` r
   library(tidyverse)
   library(arrow)
   
   tbl <- tibble(int = 1:2)
   tab <- arrow_table(tbl)
   
   # dplyr version of slice_sample:
   dplyr_res <- replicate(
       1000, 
       tbl |> slice_sample(n=1), 
       simplify = FALSE
     ) |> 
     list_rbind()
   
   # arrow version:
   arrow_res <- replicate(
       1000, 
       tab |> slice_sample(n = 1) |> collect(), 
       simplify = FALSE
     ) |>
     list_rbind()
   
   # expect close to 0.5 for each 
   table(dplyr_res$int) / nrow(dplyr_res)
   #> 
   #>     1     2 
   #> 0.522 0.478
   table(arrow_res$int) / nrow(arrow_res)
   #> 
   #>         1         2 
   #> 0.6902107 0.3097893
   
   # and 1000 results
   nrow(dplyr_res)
   #> [1] 1000
   nrow(arrow_res)
   #> [1] 807
   ```
   
   <sup>Created on 2025-02-03 with [reprex 
v2.1.1](https://reprex.tidyverse.org)</sup>
   
   <details style="margin-bottom:10px;">
   
   <summary>
   
   Session info
   </summary>
   
   ``` r
   sessionInfo()
   #> R version 4.4.2 (2024-10-31)
   #> Platform: aarch64-apple-darwin20
   #> Running under: macOS Sequoia 15.1.1
   #> 
   #> Matrix products: default
   #> BLAS:   
/Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
 
   #> LAPACK: 
/Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;
  LAPACK version 3.12.0
   #> 
   #> locale:
   #> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
   #> 
   #> time zone: America/Halifax
   #> tzcode source: internal
   #> 
   #> attached base packages:
   #> [1] stats     graphics  grDevices utils     datasets  methods   base     
   #> 
   #> other attached packages:
   #>  [1] arrow_19.0.0.9000 lubridate_1.9.3   forcats_1.0.0     stringr_1.5.1  
  
   #>  [5] dplyr_1.1.4       purrr_1.0.2       readr_2.1.5       tidyr_1.3.1    
  
   #>  [9] tibble_3.2.1      ggplot2_3.5.1     tidyverse_2.0.0  
   #> 
   #> loaded via a namespace (and not attached):
   #>  [1] bit_4.5.0         gtable_0.3.6      compiler_4.4.2    reprex_2.1.1   
  
   #>  [5] tidyselect_1.2.1  assertthat_0.2.1  scales_1.3.0      yaml_2.3.10    
  
   #>  [9] fastmap_1.2.0     R6_2.5.1          generics_0.1.3    knitr_1.49     
  
   #> [13] munsell_0.5.1     pillar_1.9.0      tzdb_0.4.0        rlang_1.1.4    
  
   #> [17] utf8_1.2.4        stringi_1.8.4     xfun_0.49         fs_1.6.5       
  
   #> [21] bit64_4.5.2       timechange_0.3.0  cli_3.6.3         withr_3.0.2    
  
   #> [25] magrittr_2.0.3    digest_0.6.37     grid_4.4.2        
rstudioapi_0.17.1
   #> [29] hms_1.1.3         lifecycle_1.0.4   vctrs_0.6.5       evaluate_1.0.1 
  
   #> [33] glue_1.8.0        fansi_1.0.6       colorspace_2.1-1  rmarkdown_2.29 
  
   #> [37] tools_4.4.2       pkgconfig_2.0.3   htmltools_0.5.8.1
   ```
   
   </details>
   
   It boils down to these two lines in `slice_sample.arrow_dplyr_query`:
   
   
https://github.com/apache/arrow/blob/ddc4bdeb1b5bbb2511aecd97b5aa8ffc6c092d7c/r/R/dplyr-slice.R#L104
   
   And:
   
   
https://github.com/apache/arrow/blob/ddc4bdeb1b5bbb2511aecd97b5aa8ffc6c092d7c/r/R/dplyr-slice.R#L132
   
   The current strategy is to convert `n` to `prop`, get answer via the `prop` 
version, then convert back to `n` by applying `head`, which will end up picking 
the first `n` rows that pass the prop filter. In @blongworth's example, it'll 
pick the first 10 out of the ~260 random rows selected out of 5000 (the 260 
comes from L104 above: `10 + 0.05 * 5000`). On top of that, the prop filter 
sometimes won't select enough rows, resulting in some empty samples (`n<1`, in 
my example).
   
   Swapping `head` to `sample` in L132 above should fix the reported bug for 
now. But someone might want to improve the sampling methodology altogether; 
e.g., referenced by the comment: 
   
   
https://github.com/apache/arrow/blob/ddc4bdeb1b5bbb2511aecd97b5aa8ffc6c092d7c/r/R/dplyr-slice.R#L110
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [R] slice_sample does not sample randomly [arrow]

Reply via email to