[ 
https://issues.apache.org/jira/browse/SPARK-29265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Florentino Sainz updated SPARK-29265:
-------------------------------------
    Description: 
Hi,

 

I had this problem in "real" environments and also made a self-contained test ( 
[^Test.scala] attached).

Having this Window definition:
{code:scala}
val myWindow = Window.partitionBy($"word").orderBy("word") 
val filt2 = filtrador.withColumn("avg_Time", 
collect_list($"number").over(myWindow))
{code}
  

In the test I can see how all elements of my DF are being collected in a single 
task.


Unbounded+unordered Window + collect_list seems to be collecting ALL the 
dataframe in a single executor/task.

groupBy + collect_list seems to do it as expect (collect_list for each group 
independently).
 

Full Code showing the error (see how the mapPartitions shows 99 rows in one 
partition) attached in Test.scala, sbt project (src and build.sbt) attached too 
in TestSpark.zip.

  was:
Hi,

 

I had this problem in "real" environments and also made a self-contained test ( 
[^Test.scala] attached).

Having this Window definition:
{code:scala}
val myWindow = Window.partitionBy($"word").orderBy("word") 
val filt2 = filtrador.withColumn("avg_Time", 
collect_list($"number").over(myWindow))
{code}
  


 

In the test I can see how all elements of my DF are in a single partition 
(side-effect of the global orderBy) 

 

Full Code showing the error (see how the mapPartitions shows 99 rows in one 
partition) attached in Test.scala, sbt project (src and build.sbt) attached too 
in TestSpark.zip.


> Window+collect_list causing single-task operation
> -------------------------------------------------
>
>                 Key: SPARK-29265
>                 URL: https://issues.apache.org/jira/browse/SPARK-29265
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.4.0
>         Environment: Any
>            Reporter: Florentino Sainz
>            Priority: Minor
>
> Hi,
>  
> I had this problem in "real" environments and also made a self-contained test 
> ( [^Test.scala] attached).
> Having this Window definition:
> {code:scala}
> val myWindow = Window.partitionBy($"word").orderBy("word") 
> val filt2 = filtrador.withColumn("avg_Time", 
> collect_list($"number").over(myWindow))
> {code}
>   
> In the test I can see how all elements of my DF are being collected in a 
> single task.
> Unbounded+unordered Window + collect_list seems to be collecting ALL the 
> dataframe in a single executor/task.
> groupBy + collect_list seems to do it as expect (collect_list for each group 
> independently).
>  
> Full Code showing the error (see how the mapPartitions shows 99 rows in one 
> partition) attached in Test.scala, sbt project (src and build.sbt) attached 
> too in TestSpark.zip.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to