I am using rdd.selfCartesian for optimization purposes. I am using Spark
for large data analytic project on relational data. My application
sometimes require to compare the table with itself looking for
inconsistency within the data regardless of the order of compared tuples.

One advantage of rdd.selfCartesian of is that it generates almost half
the results of rdd.cartesian(rdd). For example, a table with 100 rows,
the rdd.cartesian(rdd)
will generate 10000 tuples to compare while the rdd.selfCartesian will only
generate 5050 tuples.

Another advantage is that rdd.selfCartesian helps me to get rid of the
duplicate errors when searching for tuple inconsistencies. In
my application, if an error can be found for tuples with the order (tx,ty),
the same error can also be found if they are in the opposite order (ty,tx).
If I used rdd.cartesian(rdd) I will have to look for duplicate errors in
the resulted RDDPair and remove them.

Regards,
Zuhair Khayyat


On Wed, Feb 12, 2014 at 11:52 PM, rxin <g...@git.apache.org> wrote:

> Github user rxin commented on the pull request:
>
>
> https://github.com/apache/incubator-spark/pull/587#issuecomment-34916300
>
>     Thanks for submitting this. Just curious, what is the advantage of
> this over rdd.cartesian(rdd), i.e. just use cartesian to join itself?
>
>

Reply via email to