Join causes a shuffle (sending data across the network). I expect it will be better to filter before you join, so you reduce the amount of data which is sent across the network.
Note this would be true for *any* transformation which causes a shuffle. It would not be true if you're combining RDDs with union, since that doesn't cause a shuffle. On Thu, Mar 12, 2015 at 11:04 AM, shahab <shahab.mok...@gmail.com> wrote: > Hi, > > Probably this question is already answered sometime in the mailing list, > but i couldn't find it. Sorry for posting this again. > > I need to to join and apply filtering on three different RDDs, I just > wonder which of the following alternatives are more efficient: > 1- first joint all three RDDs and then do filtering on resulting joint > RDD or > 2- Apply filtering on each individual RDD and then join the resulting RDDs > > > Or probably there is no difference due to lazy evaluation and under > beneath Spark optimisation? > > best, > /Shahab >