thisisnic commented on issue #38638: URL: https://github.com/apache/arrow/issues/38638#issuecomment-1851821458
Thanks for the extra information there @lgaborini! I've looked at this again, and I think it's an unfortunate quirk of the original implementation (i.e. a known issue), as we've had to implement it a little differently as the C++ random function doesn't work, e.g. https://github.com/apache/arrow/pull/14361#issue-1403214998. I've tried updating the `min` parameter in the internal UDF to higher than the default (we get fewer rows selected) or lower than the default (we get the right number of rows selected but we get a lot of repetition). There's [this line](https://github.com/apache/arrow/blob/087fc8f5d31b377916711e98024048b76eae06e8/r/R/dplyr-slice.R#L132) that just takes the first `n` rows of data, which is probably the source of the lack of randomness. I was wondering if we can call `arrange()` to order by the random number and then take the top `n` rows, though I'm not sure if that will actually work or not. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
