blongworth commented on issue #38638: URL: https://github.com/apache/arrow/issues/38638#issuecomment-2523952609
I'm still seeing very non-random sampling with `slice_sample()` in Arrow 17.0.0. In a 400M row dataset spanning 2023-2024, a 10k row sample consistently does not contain timestamps later than Jan 2024. I'm guessing this is the known issue described above, but if a reprex would be helpful, I can put one together. As this issue could be dangerous for someone assuming a random sample, should there be a note in the docs or `slice_sample()` removed until it's fixed? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
