[jira] [Commented] (ARROW-14254) [C++] Return a random sample of rows from a query

Jonathan Keane (Jira) Thu, 28 Oct 2021 14:44:05 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-14254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17435671#comment-17435671
 ]


Jonathan Keane commented on ARROW-14254:
----------------------------------------

For what it's worth: when I've seen this kind of action in the wild people 
almost always want a random N rows and not a random %age. Very typically it's 
downsampling to something reasonable to do [ML|summary stats|exploratory 
analysis] with, and folks will take 100k or 10k or 1M or whatever they think is 
reasonable.

This is (probably) a separate issue, but one thing where taking some limited 
number of rows, if we take them always from the beginning and the data shows up 
in an order (even if the order is not always exactly the same, if it's similar 
enough to how it's stored, for example) the _randomness_ of the sample won't be 
good enough for what some people use it for. We might consider a fast 
(semi)random sample that does this, and then having a more truly random sample 
that has stronger randomness guarantees.

This is (almost definitely) a separate issue (or possibly would automagically 
work with this work + group_by), another common task here is random samples 
from some grouped set of rows e.g. "I want to have a random sample of 100 rows 
from each day from 1 year ago to today, resulting in 365 000 rows" 

> [C++] Return a random sample of rows from a query
> -------------------------------------------------
>
>                 Key: ARROW-14254
>                 URL: https://issues.apache.org/jira/browse/ARROW-14254
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Nicola Crane
>            Priority: Major
>              Labels: kernel, query-engine
>             Fix For: 7.0.0
>
>
> Please can we have a kernel that returns a random sample of rows? We've had a 
> request to be able to do this in R: 
> https://github.com/apache/arrow-cookbook/issues/83



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-14254) [C++] Return a random sample of rows from a query

Reply via email to