Hi all, Recently I've ran into a scenario to conduct two sample tests between all paired combination of columns of an RDD. But the networking load and generation of pair-wise computation is too time consuming. That has puzzled me for a long time. I want to conduct Wilcoxon rank-sum test (http://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test) here, and get the top k most similar pairs.
To be more concrete, I want to: input: original = RDD[Array[Double](3000)] output: a matrix M of the size 3000x3000, where M{i}{j} equals to the result of a certain statistical test between RDD columns, that is, original.map(_(i)) and original.map(_(j)) I've read the source code of Pearson and Spearman's correlation in MLlib Statistics, as well as the implementation of the DIMSUM algorithm in RowMatrix.scala, cuz they all conduct pair-wise computation between columns in a paralleled way. However, it seems that the reason why those tests are applicable in Spark is because they only exploit column-summary info (sum of all elements in RDD[Double[) and information in the same array, to be explicit, they are all similar to the following: input: original = RDD[Array[Double](3000)] step1: summary = original.aggregate step2: summary_br = sc.broadcast(summary) step3: result = original.map{i => val summary_v = summary_br.value; some computation on i}.aggregate output: result: a matrix of 3000x3000 They do not require info exchange between different records in RDD. However, wilcoxon test requires co-ranking between pairs. It seems I have to generate pair-wise computations one by one on RDD columns. This will conduct at least (n^2-n)/2 jobs, which is nearly 5000000 when n=3000. It is not acceptable. Does anyone have better ideas? This is really torturing me cuz I have a related project on hand! ----- Feel the sparking Spark! -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/How-can-I-do-pair-wise-computation-between-RDD-feature-columns-tp12287.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org