I opened it up today but it should help you: https://github.com/apache/spark/pull/6213
On Sat, May 16, 2015 at 6:18 PM, Chunnan Yao <yaochun...@gmail.com> wrote: > Hi all, > Recently I've ran into a scenario to conduct two sample tests between all > paired combination of columns of an RDD. But the networking load and > generation of pair-wise computation is too time consuming. That has puzzled > me for a long time. I want to conduct Wilcoxon rank-sum test > (http://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test) here, and get > the > top k most similar pairs. > > To be more concrete, I want to: > input: original = RDD[Array[Double](3000)] > output: a matrix M of the size 3000x3000, where M{i}{j} equals to the > result > of a certain statistical test between RDD columns, that is, > original.map(_(i)) and original.map(_(j)) > > I've read the source code of Pearson and Spearman's correlation in MLlib > Statistics, as well as the implementation of the DIMSUM algorithm in > RowMatrix.scala, cuz they all conduct pair-wise computation between columns > in a paralleled way. However, it seems that the reason why those tests are > applicable in Spark is because they only exploit column-summary info (sum > of > all elements in RDD[Double[) and information in the same array, to be > explicit, they are all similar to the following: > input: original = RDD[Array[Double](3000)] > step1: summary = original.aggregate > step2: summary_br = sc.broadcast(summary) > step3: result = original.map{i => val summary_v = summary_br.value; some > computation on i}.aggregate > output: result: a matrix of 3000x3000 > > They do not require info exchange between different records in RDD. > However, > wilcoxon test requires co-ranking between pairs. It seems I have to > generate > pair-wise computations one by one on RDD columns. This will conduct at > least > (n^2-n)/2 jobs, which is nearly 5000000 when n=3000. It is not acceptable. > > Does anyone have better ideas? This is really torturing me cuz I have a > related project on hand! > > > > ----- > Feel the sparking Spark! > -- > View this message in context: > http://apache-spark-developers-list.1001551.n3.nabble.com/How-can-I-do-pair-wise-computation-between-RDD-feature-columns-tp12287.html > Sent from the Apache Spark Developers List mailing list archive at > Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > For additional commands, e-mail: dev-h...@spark.apache.org > >