Hello everyone, I am interested in running an application on Spark that at some point needs to compare all elements of an RDD against all others to create a distance matrix. The RDD is of type <String, Double> and the Pearson correlation is applied to each element against all others, generating a matrix with the distance between all possible combinations of elements.
I have implemented this by taking the cartesian product of the RDD with itself, filtering half the matrix away since it is symmetric, then doing a combineByKey to get all other elements that it needs to be compared with. I map the output of this over the comparison function implementing the Pearson correlation. You can probably guess this is dead slow. I use Spark 1.6.2, the code is written in Java 8. At the rate it is processing in a cluster with 4 nodes with 16cores and 56gb ram each, for a list with ~15000 elements split in 512 partitions, the cartesian operation alone is estimated to take about 3000 hours (all cores are maxed out on all machines)! Is there a way to avoid the cartesian product to calculate what I want? Would a DataFrame join be faster? Or is this an operation that just requires a much larger cluster? Thank you, Paschalis --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org