Re: [I] [R] slice_sample returns 0 rows [arrow]

via GitHub Tue, 12 Dec 2023 03:07:29 -0800


thisisnic commented on issue #38638:
URL: https://github.com/apache/arrow/issues/38638#issuecomment-1851821458


   Thanks for the extra information there @lgaborini!
   
   I've looked at this again, and I think it's an unfortunate quirk of the 
original implementation (i.e. a known issue), as we've had to implement it a 
little differently as the C++ random function doesn't work, e.g. 
https://github.com/apache/arrow/pull/14361#issue-1403214998.
   
   I've tried updating the `min` parameter in the internal UDF to higher than 
the default (we get fewer rows selected) or lower than the default (we get the 
right number of rows selected but we get a lot of repetition).
   
   There's [this 
line](https://github.com/apache/arrow/blob/087fc8f5d31b377916711e98024048b76eae06e8/r/R/dplyr-slice.R#L132)
 that just takes the first `n` rows of data, which is probably the source of 
the lack of randomness.  I was wondering if we can call `arrange()` to order by 
the random number and then take the top `n` rows, though I'm not sure if that 
will actually work or not.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [R] slice_sample returns 0 rows [arrow]

Reply via email to