Re: Seq.toDF vs sc.parallelize.toDF = no Spark job vs one - why?

2016-06-09 Thread Jacek Laskowski
Makes sense. Thanks Michael (and welcome back from #SparkSummit!) On to exploring the space... Jacek On 9 Jun 2016 6:10 p.m., "Michael Armbrust" wrote: > Look at the explain(). For a Seq we know its just local data so avoid > spark jobs for simple operations. In

Re: Seq.toDF vs sc.parallelize.toDF = no Spark job vs one - why?

2016-06-09 Thread Michael Armbrust
Look at the explain(). For a Seq we know its just local data so avoid spark jobs for simple operations. In contrast, an RDD is opaque to catalyst so we can't perform that optimization. On Wed, Jun 8, 2016 at 7:49 AM, Jacek Laskowski wrote: > Hi, > > I just noticed it today

Seq.toDF vs sc.parallelize.toDF = no Spark job vs one - why?

2016-06-08 Thread Jacek Laskowski
Hi, I just noticed it today while toying with Spark 2.0.0 (today's build) that doing Seq(...).toDF does **not** submit a Spark job while sc.parallelize(Seq(...)).toDF does. I was nicely surprised and been thinking about the reason for the behaviour. My explanation was that Datasets are just a