[
https://issues.apache.org/jira/browse/SPARK-31140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hyukjin Kwon resolved SPARK-31140.
----------------------------------
Resolution: Won't Fix
> Support Quick sample in RDD
> ---------------------------
>
> Key: SPARK-31140
> URL: https://issues.apache.org/jira/browse/SPARK-31140
> Project: Spark
> Issue Type: Improvement
> Components: Spark Core
> Affects Versions: 3.1.0
> Reporter: deshanxiao
> Priority: Minor
>
> RDD.sample use the function of *filter* to pick up the data we need. It means
> that if the raw data is very huge, we must spend too much time reading it. We
> can filter the raw partition to speed up the processing of sample.
> {code:java}
> override def compute(splitIn: Partition, context: TaskContext): Iterator[U]
> = {
> val split = splitIn.asInstanceOf[PartitionwiseSampledRDDPartition]
> val thisSampler = sampler.clone
> thisSampler.setSeed(split.seed)
> thisSampler.sample(firstParent[T].iterator(split.prev, context))
> }
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]