[ 
https://issues.apache.org/jira/browse/SPARK-19091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15803105#comment-15803105
 ] 

Josh Rosen commented on SPARK-19091:
------------------------------------

This is a pretty easy change but it does impact things slightly in the case 
where a user relies on the degree of parallelism in sc.parallelize(). Thus 
maybe this isn't as obvious of an optimization. I'll just leave this JIRA here 
as documentation of the odd performance variation so users can judge the 
appropriate method themselves based on their use-case.

> createDataset(sc.parallelize(x: Seq)) should be equivalent to 
> createDataset(x: Seq)
> -----------------------------------------------------------------------------------
>
>                 Key: SPARK-19091
>                 URL: https://issues.apache.org/jira/browse/SPARK-19091
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>            Reporter: Josh Rosen
>
> It turns out that spark.createDataset(sc.parallelize(x: Seq)) and 
> spark.createaDataSet(x: Seq) produce different plans, where the former is 
> much less efficient due to a lack of accurate size estimation. We should 
> modify SparkSession to special-case the situation where createDataset is 
> called on a ParallelCollectionRDD in order to remove this source of 
> performance variation between the two plans.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to