brancz opened a new issue, #11554: URL: https://github.com/apache/datafusion/issues/11554
### Is your feature request related to a problem or challenge? We have a large sample of statistical data. All we need is a subset of the data that maintains statistical significance while being able to return a much smaller result to users since insignificantly small values aren't contained resulting in much lower latency. ### Describe the solution you'd like Add the ability to (statistically) sample rows. We've done this using reservoir sampling before. I imagine statistical sampling is a widely enough used function that it should be supported first-class. ### Describe alternatives you've considered I don't know enough about DataFusion to know whether this is possible via a UDF. In the past, we've had issues where records pushed into the query layer are sampled. However, the underlying record is still held onto as immediately materializing it would result in tiny and inefficient 1-row records, but eventually, they need to be materialized as otherwise memory explodes. ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org