[I] Add reservoir sampling [datafusion]

via GitHub Fri, 19 Jul 2024 11:32:36 -0700


brancz opened a new issue, #11554:
URL: https://github.com/apache/datafusion/issues/11554


   ### Is your feature request related to a problem or challenge?
   
   We have a large sample of statistical data. All we need is a subset of the 
data that maintains statistical significance while being able to return a much 
smaller result to users since insignificantly small values aren't contained 
resulting in much lower latency.
   
   ### Describe the solution you'd like
   
   Add the ability to (statistically) sample rows. We've done this using 
reservoir sampling before. I imagine statistical sampling is a widely enough 
used function that it should be supported first-class.
   
   ### Describe alternatives you've considered
   
   I don't know enough about DataFusion to know whether this is possible via a 
UDF. In the past, we've had issues where records pushed into the query layer 
are sampled. However, the underlying record is still held onto as immediately 
materializing it would result in tiny and inefficient 1-row records, but 
eventually, they need to be materialized as otherwise memory explodes.
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Add reservoir sampling [datafusion]

Reply via email to