Hi Ron,

Out of curiosity, why do you think that union is modifying an existing RDD
in place? In general all transformations, including union, will create new
RDDs, not modify old RDDs in place.

Here's a quick test:

scala> val firstRDD = sc.parallelize(1 to 5)
firstRDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[2] at
parallelize at <console>:12

scala> val secondRDD = sc.parallelize(1 to 3)
secondRDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[3] at
parallelize at <console>:12

scala> firstRDD.collect()
res1: Array[Int] = Array(1, 2, 3, 4, 5)

scala> secondRDD.collect()
res2: Array[Int] = Array(1, 2, 3)

scala> val newRDD = firstRDD.union(secondRDD)
newRDD: org.apache.spark.rdd.RDD[Int] = UnionRDD[4] at union at <console>:16

scala> newRDD.collect()
res3: Array[Int] = Array(1, 2, 3, 4, 5, 1, 2, 3)

scala> firstRDD.collect()
res4: Array[Int] = Array(1, 2, 3, 4, 5)

scala> secondRDD.collect()
res5: Array[Int] = Array(1, 2, 3)


On Fri, Dec 5, 2014 at 2:27 PM, Ron Ayoub <ronalday...@live.com> wrote:

> I'm a bit confused regarding expected behavior of unions. I'm running on 8
> cores. I have an RDD that is used to collect cluster associations (cluster
> id, content id, distance) for internal clusters as well as leaf clusters
> since I'm doing hierarchical k-means and need all distances for sorting
> documents appropriately upon examination.
>
> It appears that Union simply adds items in the argument to the RDD
> instance the method is called on rather than just returning a new RDD. If I
> want to do Union this was as more of an add/append should I be capturing
> the return value and releasing it from memory. Need help clarifying the
> semantics here.
>
> Also, in another related thread someone mentioned coalesce after union.
> Would I need to do the same on the instance RDD I'm calling Union on.
>
> Perhaps a method such as append would be useful and clearer.
>

Reply via email to