[
https://issues.apache.org/jira/browse/SPARK-19091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15803105#comment-15803105
]
Josh Rosen commented on SPARK-19091:
------------------------------------
This is a pretty easy change but it does impact things slightly in the case
where a user relies on the degree of parallelism in sc.parallelize(). Thus
maybe this isn't as obvious of an optimization. I'll just leave this JIRA here
as documentation of the odd performance variation so users can judge the
appropriate method themselves based on their use-case.
> createDataset(sc.parallelize(x: Seq)) should be equivalent to
> createDataset(x: Seq)
> -----------------------------------------------------------------------------------
>
> Key: SPARK-19091
> URL: https://issues.apache.org/jira/browse/SPARK-19091
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Reporter: Josh Rosen
>
> It turns out that spark.createDataset(sc.parallelize(x: Seq)) and
> spark.createaDataSet(x: Seq) produce different plans, where the former is
> much less efficient due to a lack of accurate size estimation. We should
> modify SparkSession to special-case the situation where createDataset is
> called on a ParallelCollectionRDD in order to remove this source of
> performance variation between the two plans.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]