[GitHub] incubator-spark pull request: Adding RDD unique self cross product

khayyatzy Thu, 13 Feb 2014 00:55:10 -0800

Github user khayyatzy commented on the pull request:

    https://github.com/apache/incubator-spark/pull/587#issuecomment-34959013
  
    I am using rdd.selfCartesian for optimization purposes. I am using Spark 
for large data analytic project on relational data. My application sometimes 
require to compare the table with itself looking for inconsistency within the 
data regardless of the order of compared tuples. 
    
    One advantage of rdd.selfCartesian of is that it generates almost half the 
results of rdd.cartesian(rdd). For example, a table with 100 rows, the 
rdd.cartesian(rdd) will generate 10000 tuples to compare while the 
rdd.selfCartesian will only generate 5050 tuples.
    
    Another advantage is that rdd.selfCartesian helps me to get rid of the 
duplicate errors when searching for tuple inconsistencies. In my application, 
if an error can be found for tuples with the order (tx,ty), the same error can 
also be found if they are in the opposite order (ty,tx). If I used 
rdd.cartesian(rdd) I will have to look for duplicate errors in the resulted 
RDDPair and remove them.
    
    Regards,
    Zuhair Khayyat

[GitHub] incubator-spark pull request: Adding RDD unique self cross product

Reply via email to