Davies Liu created SPARK-4417:
---------------------------------

             Summary: New API: sample RDD to fixed number of items
                 Key: SPARK-4417
                 URL: https://issues.apache.org/jira/browse/SPARK-4417
             Project: Spark
          Issue Type: New Feature
          Components: PySpark, Spark Core
            Reporter: Davies Liu


Sometimes, we just want to a fixed number of items randomly selected from an 
RDD, for example, before sort an RDD we need to gather a fixed number of keys 
from each partitions.

In order to do this, we need to two pass on the RDD, get the total number, then 
calculate the right ratio for sampling. In fact, we could do this in one pass.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to