subject:"How Spark sql query optimisation work if we are using .rdd action \?"

Re: How Spark sql query optimisation work if we are using .rdd action ?

2016-08-14 Thread Mich Talebzadeh

There are two distinct parts here. Optimisation + execution. Spark does not have a Cost Based Optimizer (CBO) yet but that does not matter for now. When we do such operation say outer join between (s) and (t) DFs below, we see scala> val rs = s.join(t,s("time_id")===t("time_id"),

Re: How Spark sql query optimisation work if we are using .rdd action ?

2016-08-14 Thread ayan guha

I do not think so. What I understand Spark will still use Catalyst to join. DF always has an RDD underneath, but that does not mean any action will force less optimal path. On Sun, Aug 14, 2016 at 3:04 PM, mayur bhole wrote: > HI All, > > Lets say, we have > > val df =

How Spark sql query optimisation work if we are using .rdd action ?

2016-08-13 Thread mayur bhole

HI All, Lets say, we have val df = bigTableA.join(bigTableB,bigTableA("A")===bigTableB("A"),"left") val rddFromDF = df.rdd println(rddFromDF.count) My understanding is that spark will convert all data frame operations before "rddFromDF.count" into RDD equivalent operation as we are not