[jira] [Comment Edited] (SPARK-11481) orderBy with multiple columns in WindowSpec does not work properly

Davies Liu (JIRA) Mon, 09 Nov 2015 12:14:44 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-11481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14997271#comment-14997271
 ]


Davies Liu edited comment on SPARK-11481 at 11/9/15 8:13 PM:
-------------------------------------------------------------

[~jamartinh] Could you provide more details about this? Did it fail to generate 
wrong results?

I test rank() with multiple columns in orderBy of a WindowSpec, 
{code}
from pyspark.sql.window import Window
from pyspark.sql.functions import sum, rank
df = sqlContext.range(200).selectExpr("id % 10 as a", "id % 30 as b", "rand() 
as c", "rand() as d")
w = Window.partitionBy("a").orderBy("b", "c")
df.withColumn('e', rank().over(w)).orderBy("a", "b", "c").show()
{code}

{code}
+---+---+--------------------+-------------------+---+
|  a|  b|                   c|                  d|  e|
+---+---+--------------------+-------------------+---+
|  0|  0|  0.2194700128055158| 0.6702826320240122|  1|
|  0|  0| 0.29311255730305474|0.17981084403674907|  2|
|  0|  0|  0.5020715280611638| 0.7784768711132941|  3|
|  0|  0|  0.6204718751835817|  0.658288203610106|  4|
|  0|  0|  0.6347337400705437| 0.2252903559750168|  5|
|  0|  0|  0.6821064252275095|0.17952666819892493|  6|
|  0|  0|  0.8402819764752624|0.28817032040721025|  7|
|  0| 10|0.049978271002973806|0.49749495851508385|  8|
|  0| 10|0.055777048779296456| 0.4016624707648714|  9|
|  0| 10|  0.2992817627985134| 0.5788639047928409| 10|
|  0| 10| 0.39818210393355635|0.29767369914222863| 11|
|  0| 10| 0.46253659167881533| 0.9075613037816466| 12|
|  0| 10|  0.5438188620473111|0.27446749417467664| 13|
|  0| 10|  0.9257608676148266| 0.3653579358201806| 14|
|  0| 20|0.022895438467160578| 0.5552700300213504| 15|
|  0| 20| 0.02611614998960987| 0.2130967533917808| 16|
|  0| 20|  0.7634887765995678| 0.2451834678629934| 17|
|  0| 20|  0.7746661338423646| 0.7678374501860449| 18|
|  0| 20|  0.8038587340599448| 0.3852190054888385| 19|
|  0| 20|  0.8409986361378347|0.31996442912312584| 20|
+---+---+--------------------+-------------------+---+
only showing top 20 rows
{code}


was (Author: davies):
[~jamartinh] Could you provide more details about this? Did it fail to generate 
wrong results?

I test rank() with multiple columns in orderBy of a WindowSpec, 
{code}
from pyspark.sql.window import Window
from pyspark.sql.functions import sum, rank
import sys
df = sqlContext.range(200).selectExpr("id % 10 as a", "id % 30 as b", "rand() 
as c", "rand() as d")
w = Window.partitionBy("a").orderBy("b", "c")
df.withColumn('e', rank().over(w)).orderBy("a", "b", "c").show()
{code}

{code}
+---+---+--------------------+-------------------+---+
|  a|  b|                   c|                  d|  e|
+---+---+--------------------+-------------------+---+
|  0|  0|  0.2194700128055158| 0.6702826320240122|  1|
|  0|  0| 0.29311255730305474|0.17981084403674907|  2|
|  0|  0|  0.5020715280611638| 0.7784768711132941|  3|
|  0|  0|  0.6204718751835817|  0.658288203610106|  4|
|  0|  0|  0.6347337400705437| 0.2252903559750168|  5|
|  0|  0|  0.6821064252275095|0.17952666819892493|  6|
|  0|  0|  0.8402819764752624|0.28817032040721025|  7|
|  0| 10|0.049978271002973806|0.49749495851508385|  8|
|  0| 10|0.055777048779296456| 0.4016624707648714|  9|
|  0| 10|  0.2992817627985134| 0.5788639047928409| 10|
|  0| 10| 0.39818210393355635|0.29767369914222863| 11|
|  0| 10| 0.46253659167881533| 0.9075613037816466| 12|
|  0| 10|  0.5438188620473111|0.27446749417467664| 13|
|  0| 10|  0.9257608676148266| 0.3653579358201806| 14|
|  0| 20|0.022895438467160578| 0.5552700300213504| 15|
|  0| 20| 0.02611614998960987| 0.2130967533917808| 16|
|  0| 20|  0.7634887765995678| 0.2451834678629934| 17|
|  0| 20|  0.7746661338423646| 0.7678374501860449| 18|
|  0| 20|  0.8038587340599448| 0.3852190054888385| 19|
|  0| 20|  0.8409986361378347|0.31996442912312584| 20|
+---+---+--------------------+-------------------+---+
only showing top 20 rows
{code}

> orderBy with multiple columns in WindowSpec does not work properly
> ------------------------------------------------------------------
>
>                 Key: SPARK-11481
>                 URL: https://issues.apache.org/jira/browse/SPARK-11481
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, SQL
>    Affects Versions: 1.5.1
>         Environment: All
>            Reporter: Jose Antonio
>              Labels: DataFrame, sparkSQL
>
> When using multiple columns in the orderBy of a WindowSpec the order by seems 
> to work only for the first column.
> A possible workaround is to sort previosly the DataFrame and then apply the 
> window spec over the sorted DataFrame
> e.g. 
> THIS NOT WORKS:
> window_sum = Window.partitionBy('user_unique_id').orderBy('creation_date', 
> 'mib_id', 'day').rowsBetween(-sys.maxsize, 0)
> df = df.withColumn('user_version', 
> func.sum(df.group_counter).over(window_sum))
> THIS WORKS WELL:
> df = df.sort('user_unique_id', 'creation_date', 'mib_id', 'day')
> window_sum = Window.partitionBy('user_unique_id').orderBy('creation_date', 
> 'mib_id', 'day').rowsBetween(-sys.maxsize, 0)
> df = df.withColumn('user_version', 
> func.sum(df.group_counter).over(window_sum))
> Also, can anybody confirm that this is a true workaround?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (SPARK-11481) orderBy with multiple columns in WindowSpec does not work properly

Reply via email to