[ 
https://issues.apache.org/jira/browse/ARROW-14254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17435663#comment-17435663
 ] 

Weston Pace commented on ARROW-14254:
-------------------------------------

> Approach 1 sounds reasonable, though I don't understand why sorting is 
> required. Just use a streaming top-k.

Ah, good point.  I had forgotten that top-k implicitly sorted (I was thinking 
of it more as a head).

> 1 and 2 would satisfy the behavior that dplyr::slice_sample() supports. So if 
> we had random number generating kernels, we could do them.

This would not give the exact same behavior as dplyr::slice_sample.  For 
example:
 * slice_sample supports sampling with replacement, neither 1 or 2 would 
support that.  I think you would need something more like the approach Antoine 
originally suggested using take but you would need some kind of streaming take.
 * There would be no support for weight_by

Both percentage and exact count would be needed ahead of time.  Whether the 
user supplies a percentage or an exact count the frontend will need to do two 
passes.  First to get the count and then second to get the sample.  Getting the 
count should be a metadata-only scan for reasonable (non-csv) formats.

I'll add issue links to the dependencies.

> [C++] Return a random sample of rows from a query
> -------------------------------------------------
>
>                 Key: ARROW-14254
>                 URL: https://issues.apache.org/jira/browse/ARROW-14254
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Nicola Crane
>            Priority: Major
>              Labels: kernel, query-engine
>             Fix For: 7.0.0
>
>
> Please can we have a kernel that returns a random sample of rows? We've had a 
> request to be able to do this in R: 
> https://github.com/apache/arrow-cookbook/issues/83



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to