[
https://issues.apache.org/jira/browse/SPARK-11481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14997271#comment-14997271
]
Davies Liu edited comment on SPARK-11481 at 11/9/15 8:13 PM:
-------------------------------------------------------------
[~jamartinh] Could you provide more details about this? Did it fail to generate
wrong results?
I test rank() with multiple columns in orderBy of a WindowSpec,
{code}
from pyspark.sql.window import Window
from pyspark.sql.functions import sum, rank
df = sqlContext.range(200).selectExpr("id % 10 as a", "id % 30 as b", "rand()
as c", "rand() as d")
w = Window.partitionBy("a").orderBy("b", "c")
df.withColumn('e', rank().over(w)).orderBy("a", "b", "c").show()
{code}
{code}
+---+---+--------------------+-------------------+---+
| a| b| c| d| e|
+---+---+--------------------+-------------------+---+
| 0| 0| 0.2194700128055158| 0.6702826320240122| 1|
| 0| 0| 0.29311255730305474|0.17981084403674907| 2|
| 0| 0| 0.5020715280611638| 0.7784768711132941| 3|
| 0| 0| 0.6204718751835817| 0.658288203610106| 4|
| 0| 0| 0.6347337400705437| 0.2252903559750168| 5|
| 0| 0| 0.6821064252275095|0.17952666819892493| 6|
| 0| 0| 0.8402819764752624|0.28817032040721025| 7|
| 0| 10|0.049978271002973806|0.49749495851508385| 8|
| 0| 10|0.055777048779296456| 0.4016624707648714| 9|
| 0| 10| 0.2992817627985134| 0.5788639047928409| 10|
| 0| 10| 0.39818210393355635|0.29767369914222863| 11|
| 0| 10| 0.46253659167881533| 0.9075613037816466| 12|
| 0| 10| 0.5438188620473111|0.27446749417467664| 13|
| 0| 10| 0.9257608676148266| 0.3653579358201806| 14|
| 0| 20|0.022895438467160578| 0.5552700300213504| 15|
| 0| 20| 0.02611614998960987| 0.2130967533917808| 16|
| 0| 20| 0.7634887765995678| 0.2451834678629934| 17|
| 0| 20| 0.7746661338423646| 0.7678374501860449| 18|
| 0| 20| 0.8038587340599448| 0.3852190054888385| 19|
| 0| 20| 0.8409986361378347|0.31996442912312584| 20|
+---+---+--------------------+-------------------+---+
only showing top 20 rows
{code}
was (Author: davies):
[~jamartinh] Could you provide more details about this? Did it fail to generate
wrong results?
I test rank() with multiple columns in orderBy of a WindowSpec,
{code}
from pyspark.sql.window import Window
from pyspark.sql.functions import sum, rank
import sys
df = sqlContext.range(200).selectExpr("id % 10 as a", "id % 30 as b", "rand()
as c", "rand() as d")
w = Window.partitionBy("a").orderBy("b", "c")
df.withColumn('e', rank().over(w)).orderBy("a", "b", "c").show()
{code}
{code}
+---+---+--------------------+-------------------+---+
| a| b| c| d| e|
+---+---+--------------------+-------------------+---+
| 0| 0| 0.2194700128055158| 0.6702826320240122| 1|
| 0| 0| 0.29311255730305474|0.17981084403674907| 2|
| 0| 0| 0.5020715280611638| 0.7784768711132941| 3|
| 0| 0| 0.6204718751835817| 0.658288203610106| 4|
| 0| 0| 0.6347337400705437| 0.2252903559750168| 5|
| 0| 0| 0.6821064252275095|0.17952666819892493| 6|
| 0| 0| 0.8402819764752624|0.28817032040721025| 7|
| 0| 10|0.049978271002973806|0.49749495851508385| 8|
| 0| 10|0.055777048779296456| 0.4016624707648714| 9|
| 0| 10| 0.2992817627985134| 0.5788639047928409| 10|
| 0| 10| 0.39818210393355635|0.29767369914222863| 11|
| 0| 10| 0.46253659167881533| 0.9075613037816466| 12|
| 0| 10| 0.5438188620473111|0.27446749417467664| 13|
| 0| 10| 0.9257608676148266| 0.3653579358201806| 14|
| 0| 20|0.022895438467160578| 0.5552700300213504| 15|
| 0| 20| 0.02611614998960987| 0.2130967533917808| 16|
| 0| 20| 0.7634887765995678| 0.2451834678629934| 17|
| 0| 20| 0.7746661338423646| 0.7678374501860449| 18|
| 0| 20| 0.8038587340599448| 0.3852190054888385| 19|
| 0| 20| 0.8409986361378347|0.31996442912312584| 20|
+---+---+--------------------+-------------------+---+
only showing top 20 rows
{code}
> orderBy with multiple columns in WindowSpec does not work properly
> ------------------------------------------------------------------
>
> Key: SPARK-11481
> URL: https://issues.apache.org/jira/browse/SPARK-11481
> Project: Spark
> Issue Type: Bug
> Components: PySpark, SQL
> Affects Versions: 1.5.1
> Environment: All
> Reporter: Jose Antonio
> Labels: DataFrame, sparkSQL
>
> When using multiple columns in the orderBy of a WindowSpec the order by seems
> to work only for the first column.
> A possible workaround is to sort previosly the DataFrame and then apply the
> window spec over the sorted DataFrame
> e.g.
> THIS NOT WORKS:
> window_sum = Window.partitionBy('user_unique_id').orderBy('creation_date',
> 'mib_id', 'day').rowsBetween(-sys.maxsize, 0)
> df = df.withColumn('user_version',
> func.sum(df.group_counter).over(window_sum))
> THIS WORKS WELL:
> df = df.sort('user_unique_id', 'creation_date', 'mib_id', 'day')
> window_sum = Window.partitionBy('user_unique_id').orderBy('creation_date',
> 'mib_id', 'day').rowsBetween(-sys.maxsize, 0)
> df = df.withColumn('user_version',
> func.sum(df.group_counter).over(window_sum))
> Also, can anybody confirm that this is a true workaround?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]