[ https://issues.apache.org/jira/browse/SPARK-29265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Florentino Sainz updated SPARK-29265: ------------------------------------- Description: Hi, I had this problem in "real" environments and also made a self-contained test ( [^Test.scala] attached). Having this Window definition: {code:scala} val myWindow = Window.partitionBy($"word").orderBy("word") val filt2 = filtrador.withColumn("avg_Time", collect_list($"number").over(myWindow)) {code} In the test I can see how all elements of my DF are being collected in a single task. Unbounded+unordered Window + collect_list seems to be collecting ALL the dataframe in a single executor/task. groupBy + collect_list seems to do it as expect (collect_list for each group independently). Full Code showing the error (see how the mapPartitions shows 99 rows in one partition) attached in Test.scala, sbt project (src and build.sbt) attached too in TestSpark.zip. was: Hi, I had this problem in "real" environments and also made a self-contained test ( [^Test.scala] attached). Having this Window definition: {code:scala} val myWindow = Window.partitionBy($"word").orderBy("word") val filt2 = filtrador.withColumn("avg_Time", collect_list($"number").over(myWindow)) {code} In the test I can see how all elements of my DF are in a single partition (side-effect of the global orderBy) Full Code showing the error (see how the mapPartitions shows 99 rows in one partition) attached in Test.scala, sbt project (src and build.sbt) attached too in TestSpark.zip. > Window+collect_list causing single-task operation > ------------------------------------------------- > > Key: SPARK-29265 > URL: https://issues.apache.org/jira/browse/SPARK-29265 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 2.4.0 > Environment: Any > Reporter: Florentino Sainz > Priority: Minor > > Hi, > > I had this problem in "real" environments and also made a self-contained test > ( [^Test.scala] attached). > Having this Window definition: > {code:scala} > val myWindow = Window.partitionBy($"word").orderBy("word") > val filt2 = filtrador.withColumn("avg_Time", > collect_list($"number").over(myWindow)) > {code} > > In the test I can see how all elements of my DF are being collected in a > single task. > Unbounded+unordered Window + collect_list seems to be collecting ALL the > dataframe in a single executor/task. > groupBy + collect_list seems to do it as expect (collect_list for each group > independently). > > Full Code showing the error (see how the mapPartitions shows 99 rows in one > partition) attached in Test.scala, sbt project (src and build.sbt) attached > too in TestSpark.zip. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org