Look at the explain().  For a Seq we know its just local data so avoid
spark jobs for simple operations.  In contrast, an RDD is opaque to
catalyst so we can't perform that optimization.

On Wed, Jun 8, 2016 at 7:49 AM, Jacek Laskowski <ja...@japila.pl> wrote:

> Hi,
>
> I just noticed it today while toying with Spark 2.0.0 (today's build)
> that doing Seq(...).toDF does **not** submit a Spark job while
> sc.parallelize(Seq(...)).toDF does. I was nicely surprised and been
> thinking about the reason for the behaviour.
>
> My explanation was that Datasets are just a "view" layer atop data and
> when this data is local/in memory already there's no need to submit a
> job to...well...compute the data.
>
> I'd appreciate more in-depth answer, perhaps with links to the code.
> Thanks!
>
> Pozdrawiam,
> Jacek Laskowski
> ----
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Reply via email to