Github user khayyatzy commented on the pull request:
https://github.com/apache/incubator-spark/pull/587#issuecomment-34959013
I am using rdd.selfCartesian for optimization purposes. I am using Spark
for large data analytic project on relational data. My application sometimes
require to compare the table with itself looking for inconsistency within the
data regardless of the order of compared tuples.
One advantage of rdd.selfCartesian of is that it generates almost half the
results of rdd.cartesian(rdd). For example, a table with 100 rows, the
rdd.cartesian(rdd) will generate 10000 tuples to compare while the
rdd.selfCartesian will only generate 5050 tuples.
Another advantage is that rdd.selfCartesian helps me to get rid of the
duplicate errors when searching for tuple inconsistencies. In my application,
if an error can be found for tuples with the order (tx,ty), the same error can
also be found if they are in the opposite order (ty,tx). If I used
rdd.cartesian(rdd) I will have to look for duplicate errors in the resulted
RDDPair and remove them.
Regards,
Zuhair Khayyat