GitHub user mengxr opened a pull request: https://github.com/apache/incubator-spark/pull/578
Adding assignRanks and assignUniqueIds to RDD Assign ranks to an ordered or unordered data set is a common operation. This could be done by first counting records in each partition and then assign ranks in parallel. The purpose of assigning ranks to an unordered set is usually to get a unique id for each item, e.g., to map feature names to feature indices. In such cases, the assignment could be done without counting records, saving one spark job. https://spark-project.atlassian.net/browse/SPARK-1076 You can merge this pull request into a Git repository by running: $ git pull https://github.com/apache/incubator-spark rank Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-spark/pull/578.patch ---- commit 21b434b77f1a7ffd75ba2d1ad4ab2296f1914971 Author: Xiangrui Meng <m...@databricks.com> Date: 2014-02-10T23:18:41Z add assignRanks and assignUniqueIds to RDD commit 630868c88f14ea955991acfd3d68caa8be6dedec Author: Xiangrui Meng <m...@databricks.com> Date: 2014-02-10T23:20:21Z newline ----