Thanks Imran, I'll try your suggestions.
I eventually got this to run by 'checkpointing' the joined RDD (according
to Akhil's suggestion) before performing the reduceBy, and then
checkpointing it again afterward. i.e.
> val rdd2 = rdd.join(rdd, numPartitions=1000)
> .map(fp=>((fp._2._1, fp._2._2
Hi Tom,
there are a couple of things you can do here to make this more efficient.
first, I think you can replace your self-join with a groupByKey. on your
example data set, this would give you
(1, Iterable(2,3))
(4, Iterable(3))
this reduces the amount of data that needs to be shuffled, and tha
Thanks for the reply, I'll try your suggestions.
Apologies, in my previous post I was mistaken. rdd is actually an PairRDD
of (Int, Int). I'm doing the self-join so I can count two things. First, I
can count the number of times a value appears in the data set. Second I can
count number of times va
Why are you joining the rdd with itself?
You can try these things:
- Change the StorageLevel of both rdds to MEMORY_AND_DISK_2 or
MEMORY_AND_DISK_SER, so that it doesnt need to keep everything up in memory.
- Set your default Serializer to Kryo (.set("spark.serializer",
"org.apache.spark.seriali
Hi All,
I'm a new Spark (and Hadoop) user and I want to find out if the cluster
resources I am using are feasible for my use-case. The following is a
snippet of code that is causing a OOM exception in the executor after about
125/1000 tasks during the map stage.
> val rdd2 = rdd.join(rdd, numPart