[
https://issues.apache.org/jira/browse/SPARK-546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14056355#comment-14056355
]
sam commented on SPARK-546:
---------------------------
We use a pimp-my-library pattern to add this functionality. Basically here's
our code:
case class OuterJoinableRDD[K: ClassManifest, V1: ClassManifest](rdd: RDD[(K,
V1)]) extends RDDWrapper[(K, V1)] {
def outerJoin[V2](other: RDD[(K, V2)], numPartitions: Int): RDD[(K,
(Option[V1], Option[V2]))] =
rdd.cogroup(other, new HashPartitioner(numPartitions)).flatMapValues {
case (v1s, Seq()) => v1s.iterator.map(v1 => (Some(v1), None))
case (Seq(), v2s) => v2s.iterator.map(v2 => (None, Some(v2)))
case (v1s, v2s) => v1s.iterator.flatMap(v1 => v2s.iterator.map(v2 =>
(Some(v1), Some(v2))))
}
}
Hope it helps :) (disclaimer - code in testing)
> Support full outer join and multiple join in a single shuffle
> -------------------------------------------------------------
>
> Key: SPARK-546
> URL: https://issues.apache.org/jira/browse/SPARK-546
> Project: Spark
> Issue Type: Improvement
> Reporter: Reynold Xin
>
> RDD[(K,V)] now supports left/right outer join but not full outer join.
> Also it'd be nice to provide a way for users to join multiple RDDs on the
> same key in a single shuffle.
--
This message was sent by Atlassian JIRA
(v6.2#6252)