[
https://issues.apache.org/jira/browse/ARROW-14254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17435671#comment-17435671
]
Jonathan Keane commented on ARROW-14254:
----------------------------------------
For what it's worth: when I've seen this kind of action in the wild people
almost always want a random N rows and not a random %age. Very typically it's
downsampling to something reasonable to do [ML|summary stats|exploratory
analysis] with, and folks will take 100k or 10k or 1M or whatever they think is
reasonable.
This is (probably) a separate issue, but one thing where taking some limited
number of rows, if we take them always from the beginning and the data shows up
in an order (even if the order is not always exactly the same, if it's similar
enough to how it's stored, for example) the _randomness_ of the sample won't be
good enough for what some people use it for. We might consider a fast
(semi)random sample that does this, and then having a more truly random sample
that has stronger randomness guarantees.
This is (almost definitely) a separate issue (or possibly would automagically
work with this work + group_by), another common task here is random samples
from some grouped set of rows e.g. "I want to have a random sample of 100 rows
from each day from 1 year ago to today, resulting in 365 000 rows"
> [C++] Return a random sample of rows from a query
> -------------------------------------------------
>
> Key: ARROW-14254
> URL: https://issues.apache.org/jira/browse/ARROW-14254
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Reporter: Nicola Crane
> Priority: Major
> Labels: kernel, query-engine
> Fix For: 7.0.0
>
>
> Please can we have a kernel that returns a random sample of rows? We've had a
> request to be able to do this in R:
> https://github.com/apache/arrow-cookbook/issues/83
--
This message was sent by Atlassian Jira
(v8.3.4#803005)