[ https://issues.apache.org/jira/browse/SPARK-4500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hyukjin Kwon resolved SPARK-4500. --------------------------------- Resolution: Incomplete > Improve exact stratified sampling implementation > ------------------------------------------------ > > Key: SPARK-4500 > URL: https://issues.apache.org/jira/browse/SPARK-4500 > Project: Spark > Issue Type: Improvement > Components: MLlib > Affects Versions: 1.2.0 > Reporter: Joseph K. Bradley > Priority: Major > Labels: bulk-closed > > The current implementation for exact stratified sampling (sampleByKeyExact) > could be more efficient. Proposed algorithm sketch: > * Sampling is done separately for each stratum. Here, all counts are w.r.t. > a fixed stratum. > * Let n_partition = number of elements in a given partition > * This method uses 2-tiered sampling: > ** Tier 1 (on driver): Select the sample sizes {n_partition} for each > partition. > *** Without replacement, this uses a multivariate hypergeometric distribution. > *** With replacement, this should use a multinomial distribution. > ** Tier 2 (in parallel): Select a sample of size n_partition. > If anyone is interested in implementing this, I have a rough draft which > works without replacement, but it needs to be cleaned up and augmented to do > sampling with replacement too. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org