Re: Seq.toDF vs sc.parallelize.toDF = no Spark job vs one - why?

Michael Armbrust Thu, 09 Jun 2016 09:11:15 -0700

Look at the explain().  For a Seq we know its just local data so avoid
spark jobs for simple operations.  In contrast, an RDD is opaque to
catalyst so we can't perform that optimization.


On Wed, Jun 8, 2016 at 7:49 AM, Jacek Laskowski <ja...@japila.pl> wrote:

> Hi,
>
> I just noticed it today while toying with Spark 2.0.0 (today's build)
> that doing Seq(...).toDF does **not** submit a Spark job while
> sc.parallelize(Seq(...)).toDF does. I was nicely surprised and been
> thinking about the reason for the behaviour.
>
> My explanation was that Datasets are just a "view" layer atop data and
> when this data is local/in memory already there's no need to submit a
> job to...well...compute the data.
>
> I'd appreciate more in-depth answer, perhaps with links to the code.
> Thanks!
>
> Pozdrawiam,
> Jacek Laskowski
> ----
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Seq.toDF vs sc.parallelize.toDF = no Spark job vs one - why?

Reply via email to