thisisnic commented on issue #38638:
URL: https://github.com/apache/arrow/issues/38638#issuecomment-1851821458

   Thanks for the extra information there @lgaborini!
   
   I've looked at this again, and I think it's an unfortunate quirk of the 
original implementation (i.e. a known issue), as we've had to implement it a 
little differently as the C++ random function doesn't work, e.g. 
https://github.com/apache/arrow/pull/14361#issue-1403214998.
   
   I've tried updating the `min` parameter in the internal UDF to higher than 
the default (we get fewer rows selected) or lower than the default (we get the 
right number of rows selected but we get a lot of repetition).
   
   There's [this 
line](https://github.com/apache/arrow/blob/087fc8f5d31b377916711e98024048b76eae06e8/r/R/dplyr-slice.R#L132)
 that just takes the first `n` rows of data, which is probably the source of 
the lack of randomness.  I was wondering if we can call `arrange()` to order by 
the random number and then take the top `n` rows, though I'm not sure if that 
will actually work or not.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to