How can I do pair-wise computation between RDD feature columns?

Chunnan Yao Sat, 16 May 2015 18:19:27 -0700

Hi all, 
Recently I've ran into a scenario to conduct two sample tests between all
paired combination of columns of an RDD. But the networking load and
generation of pair-wise computation is too time consuming. That has puzzled
me for a long time. I want to conduct Wilcoxon rank-sum test
(http://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test) here, and get the
top k most similar pairs.


To be more concrete, I want to: 
input: original = RDD[Array[Double](3000)] 
output: a matrix M of the size 3000x3000, where M{i}{j} equals to the result
of a certain statistical test between RDD columns, that is,
original.map(_(i)) and original.map(_(j)) 

I've read the source code of Pearson and Spearman's correlation in MLlib
Statistics, as well as the implementation of the DIMSUM algorithm in
RowMatrix.scala, cuz they all conduct pair-wise computation between columns 
in a paralleled way. However, it seems that the reason why those tests are
applicable in Spark is because they only exploit column-summary info (sum of
all elements in RDD[Double[) and information in the same array, to be
explicit, they are all similar to the following: 
input: original = RDD[Array[Double](3000)] 
step1: summary = original.aggregate 
step2: summary_br = sc.broadcast(summary) 
step3: result =  original.map{i => val summary_v = summary_br.value; some
computation on i}.aggregate 
output: result: a matrix of 3000x3000 

They do not require info exchange between different records in RDD. However,
wilcoxon test requires co-ranking between pairs. It seems I have to generate
pair-wise computations one by one on RDD columns. This will conduct at least
(n^2-n)/2 jobs, which is nearly 5000000 when n=3000. It is not acceptable. 

Does anyone have better ideas? This is really torturing me cuz I have a
related project on hand!



-----
Feel the sparking Spark!
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/How-can-I-do-pair-wise-computation-between-RDD-feature-columns-tp12287.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

How can I do pair-wise computation between RDD feature columns?

Reply via email to