I am using rdd.selfCartesian for optimization purposes. I am using Spark for large data analytic project on relational data. My application sometimes require to compare the table with itself looking for inconsistency within the data regardless of the order of compared tuples.
One advantage of rdd.selfCartesian of is that it generates almost half the results of rdd.cartesian(rdd). For example, a table with 100 rows, the rdd.cartesian(rdd) will generate 10000 tuples to compare while the rdd.selfCartesian will only generate 5050 tuples. Another advantage is that rdd.selfCartesian helps me to get rid of the duplicate errors when searching for tuple inconsistencies. In my application, if an error can be found for tuples with the order (tx,ty), the same error can also be found if they are in the opposite order (ty,tx). If I used rdd.cartesian(rdd) I will have to look for duplicate errors in the resulted RDDPair and remove them. Regards, Zuhair Khayyat On Wed, Feb 12, 2014 at 11:52 PM, rxin <g...@git.apache.org> wrote: > Github user rxin commented on the pull request: > > > https://github.com/apache/incubator-spark/pull/587#issuecomment-34916300 > > Thanks for submitting this. Just curious, what is the advantage of > this over rdd.cartesian(rdd), i.e. just use cartesian to join itself? > >