[
https://issues.apache.org/jira/browse/SPARK-29265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Florentino Sainz updated SPARK-29265:
-------------------------------------
Description:
Hi,
I had this problem in "real" environments and also made a self-contained test (
[^Test.scala] attached).
Having this Window definition:
{code:scala}
val myWindow = Window.partitionBy($"word").orderBy("word")
val filt2 = filtrador.withColumn("avg_Time",
avg($"number").over(myWindow)){code}
As a user, I would expect either:
1- Error/warning (because trying to sort on one of the columns of the window
partitionBy)
2- A mostly-useless operation which just orders the rows inside each Window but
doesn't affect performance too much.
Currently what I see:
*When I use "myWindow" in any DataFrame, somehow that Window.orderBy is
performing a global orderBy of the whole DataFrame. Similar to
dataframe.orderBy("word").*
*In my real environment, my program just didn't finish in time/crashed thus
causing my program to be very slow or crash (because as it's a global orderBy,
it will just go to one executor).*
In the test I can see how all elements of my DF are in a single partition
(side-effect of the global orderBy)
Full Code showing the error (see how the mapPartitions shows 99 rows in one
partition) attached in Test.scala, sbt project (src and build.sbt) attached too.
was:
Hi,
I had this problem in "real" environments and also made a self-contained test (
[^Test.scala] attached).
Having this Window definition:
{code:scala}
val myWindow = Window.partitionBy($"word").orderBy("word")
val filt2 = filtrador.withColumn("avg_Time",
avg($"number").over(myWindow)){code}
As a user, I would expect either:
1- Error/warning (because trying to sort on one of the columns of the window
partitionBy)
2- A mostly-useless operation which just orders the rows inside each Window but
doesn't affect performance too much.
Currently what I see:
*When I use "myWindow" in any DataFrame, somehow that Window.orderBy is
performing a global orderBy of the whole DataFrame. Similar to
dataframe.orderBy("word").*
*In my real environment, my program just didn't finish in time/crashed thus
causing my program to be very slow or crash (because as it's a global orderBy,
it will just go to one executor).*
In the test I can see how all elements of my DF are in a single partition
(side-effect of the global orderBy)
Full Code showing the error (see how the mapPartitions shows 99 rows in one
partition) attached in Test.scala
> Window orderBy causing full-DF orderBy
> ---------------------------------------
>
> Key: SPARK-29265
> URL: https://issues.apache.org/jira/browse/SPARK-29265
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 2.3.0, 2.4.3, 2.4.4
> Environment: Any
> Reporter: Florentino Sainz
> Priority: Minor
> Attachments: Test.scala, TestSpark.zip
>
>
> Hi,
>
> I had this problem in "real" environments and also made a self-contained test
> ( [^Test.scala] attached).
> Having this Window definition:
> {code:scala}
> val myWindow = Window.partitionBy($"word").orderBy("word")
> val filt2 = filtrador.withColumn("avg_Time",
> avg($"number").over(myWindow)){code}
>
> As a user, I would expect either:
> 1- Error/warning (because trying to sort on one of the columns of the window
> partitionBy)
> 2- A mostly-useless operation which just orders the rows inside each Window
> but doesn't affect performance too much.
>
> Currently what I see:
> *When I use "myWindow" in any DataFrame, somehow that Window.orderBy is
> performing a global orderBy of the whole DataFrame. Similar to
> dataframe.orderBy("word").*
> *In my real environment, my program just didn't finish in time/crashed thus
> causing my program to be very slow or crash (because as it's a global
> orderBy, it will just go to one executor).*
>
> In the test I can see how all elements of my DF are in a single partition
> (side-effect of the global orderBy)
>
> Full Code showing the error (see how the mapPartitions shows 99 rows in one
> partition) attached in Test.scala, sbt project (src and build.sbt) attached
> too.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]