A common reason for the "Joining ... is slow" message is that you're
joining VertexRDDs without having cached them first. This will cause Spark
to recompute unnecessarily, and as a side effect, the same index will get
created twice and GraphX won't be able to do an efficient zip join.

For example, the following code will counterintuitively produce the
"Joining ... is slow" message:

val a = VertexRDD(sc.parallelize((1 to 100).map(x => (x.toLong, x))))
a.leftJoin(a) { (id, a, b) => a + b }

The remedy is to call a.cache() before a.leftJoin(a).

Ankur <http://www.ankurdave.com/>

Reply via email to