[jira] [Commented] (SPARK-546) Support full outer join and multiple join in a single shuffle

sam (JIRA) Wed, 09 Jul 2014 08:42:03 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14056355#comment-14056355
 ]


sam commented on SPARK-546:
---------------------------

We use a pimp-my-library pattern to add this functionality. Basically here's 
our code:

case class OuterJoinableRDD[K: ClassManifest, V1: ClassManifest](rdd: RDD[(K, 
V1)]) extends RDDWrapper[(K, V1)] {
  def outerJoin[V2](other: RDD[(K, V2)], numPartitions: Int): RDD[(K, 
(Option[V1], Option[V2]))] = 
    rdd.cogroup(other, new HashPartitioner(numPartitions)).flatMapValues {
      case (v1s, Seq()) => v1s.iterator.map(v1 => (Some(v1), None))
      case (Seq(), v2s) => v2s.iterator.map(v2 => (None, Some(v2)))
      case (v1s, v2s) => v1s.iterator.flatMap(v1 => v2s.iterator.map(v2 => 
(Some(v1), Some(v2))))
    }
}

Hope it helps :) (disclaimer - code in testing)

> Support full outer join and multiple join in a single shuffle
> -------------------------------------------------------------
>
>                 Key: SPARK-546
>                 URL: https://issues.apache.org/jira/browse/SPARK-546
>             Project: Spark
>          Issue Type: Improvement
>            Reporter: Reynold Xin
>
> RDD[(K,V)] now supports left/right outer join but not full outer join.
> Also it'd be nice to provide a way for users to join multiple RDDs on the 
> same key in a single shuffle.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-546) Support full outer join and multiple join in a single shuffle

Reply via email to