Hello everyone,

I am interested in running an application on Spark that at some point
needs to compare all elements of an RDD against all others to create a
distance matrix. The RDD is of type <String, Double> and the Pearson
correlation is applied to each element against all others, generating
a matrix with the distance between all possible combinations of
elements.

I have implemented this by taking the cartesian product of the RDD
with itself, filtering half the matrix away since it is symmetric,
then doing a combineByKey to get all other elements that it needs to
be compared with. I map the output of this over the comparison
function implementing the Pearson correlation.

You can probably guess this is dead slow. I use Spark 1.6.2, the code
is written in Java 8. At the rate it is processing in a cluster with 4
nodes with 16cores and 56gb ram each, for a list with ~15000 elements
split in 512 partitions, the cartesian operation alone is estimated to
take about 3000 hours (all cores are maxed out on all machines)!

Is there a way to avoid the cartesian product to calculate what I
want? Would a DataFrame join be faster? Or is this an operation that
just requires a much larger cluster?

Thank you,

Paschalis

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to