[ 
https://issues.apache.org/jira/browse/SPARK-30101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16986722#comment-16986722
 ] 

sam commented on SPARK-30101:
-----------------------------

[~cloud_fan] [~kabhwan] Well this is at least a documentation error since 
`spark.sql.shuffle.partitions` isn't even in the configuration documentation 
https://spark.apache.org/docs/latest/configuration.html

Also "Returns a new Dataset that contains only the unique rows from this 
Dataset. This is an alias for dropDuplicates." in 
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset
 should be updated to say "... and repartitions this with 
`spark.sql.shuffle.partitions` partitions".

Do you agree we need a feature ticket to add `numPartitions` as an optional 
param to `distinct` since most shuffle operations have this?

> Dataset distinct does not respect spark.default.parallelism
> -----------------------------------------------------------
>
>                 Key: SPARK-30101
>                 URL: https://issues.apache.org/jira/browse/SPARK-30101
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.4.0, 2.4.4
>            Reporter: sam
>            Priority: Major
>
> I'm creating a `SparkSession` like this:
> ```
> SparkSession
>       .builder().appName("foo").master("local")
>       .config("spark.default.parallelism", 2).getOrCreate()
> ```
> when I run
> ```
> ((1 to 10) ++ (1 to 10)).toDS().distinct().count()
> ```
> I get 200 partitions
> ```
> 19/12/02 10:29:34 INFO TaskSchedulerImpl: Adding task set 1.0 with 200 tasks
> ...
> 19/12/02 10:29:34 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) 
> in 46 ms on localhost (executor driver) (1/200)
> ```
> It is the `distinct` that is broken since `ds.rdd.getNumPartitions` gives 
> `2`, while `ds.distinct().rdd.getNumPartitions` gives `200`.  
> `ds.rdd.groupBy(identity).map(_._2.head)` and `ds.rdd.distinct()` work 
> correctly.
> Finally I notice that the good old `RDD` interface has a `distinct` that 
> accepts `numPartitions` partitions, while `Dataset` does not.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to