Davies Liu created SPARK-4417:
---------------------------------
Summary: New API: sample RDD to fixed number of items
Key: SPARK-4417
URL: https://issues.apache.org/jira/browse/SPARK-4417
Project: Spark
Issue Type: New Feature
Components: PySpark, Spark Core
Reporter: Davies Liu
Sometimes, we just want to a fixed number of items randomly selected from an
RDD, for example, before sort an RDD we need to gather a fixed number of keys
from each partitions.
In order to do this, we need to two pass on the RDD, get the total number, then
calculate the right ratio for sampling. In fact, we could do this in one pass.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]