[ https://issues.apache.org/jira/browse/SPARK-11481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14997271#comment-14997271 ]
Davies Liu edited comment on SPARK-11481 at 11/9/15 8:13 PM: ------------------------------------------------------------- [~jamartinh] Could you provide more details about this? Did it fail to generate wrong results? I test rank() with multiple columns in orderBy of a WindowSpec, {code} from pyspark.sql.window import Window from pyspark.sql.functions import sum, rank df = sqlContext.range(200).selectExpr("id % 10 as a", "id % 30 as b", "rand() as c", "rand() as d") w = Window.partitionBy("a").orderBy("b", "c") df.withColumn('e', rank().over(w)).orderBy("a", "b", "c").show() {code} {code} +---+---+--------------------+-------------------+---+ | a| b| c| d| e| +---+---+--------------------+-------------------+---+ | 0| 0| 0.2194700128055158| 0.6702826320240122| 1| | 0| 0| 0.29311255730305474|0.17981084403674907| 2| | 0| 0| 0.5020715280611638| 0.7784768711132941| 3| | 0| 0| 0.6204718751835817| 0.658288203610106| 4| | 0| 0| 0.6347337400705437| 0.2252903559750168| 5| | 0| 0| 0.6821064252275095|0.17952666819892493| 6| | 0| 0| 0.8402819764752624|0.28817032040721025| 7| | 0| 10|0.049978271002973806|0.49749495851508385| 8| | 0| 10|0.055777048779296456| 0.4016624707648714| 9| | 0| 10| 0.2992817627985134| 0.5788639047928409| 10| | 0| 10| 0.39818210393355635|0.29767369914222863| 11| | 0| 10| 0.46253659167881533| 0.9075613037816466| 12| | 0| 10| 0.5438188620473111|0.27446749417467664| 13| | 0| 10| 0.9257608676148266| 0.3653579358201806| 14| | 0| 20|0.022895438467160578| 0.5552700300213504| 15| | 0| 20| 0.02611614998960987| 0.2130967533917808| 16| | 0| 20| 0.7634887765995678| 0.2451834678629934| 17| | 0| 20| 0.7746661338423646| 0.7678374501860449| 18| | 0| 20| 0.8038587340599448| 0.3852190054888385| 19| | 0| 20| 0.8409986361378347|0.31996442912312584| 20| +---+---+--------------------+-------------------+---+ only showing top 20 rows {code} was (Author: davies): [~jamartinh] Could you provide more details about this? Did it fail to generate wrong results? I test rank() with multiple columns in orderBy of a WindowSpec, {code} from pyspark.sql.window import Window from pyspark.sql.functions import sum, rank import sys df = sqlContext.range(200).selectExpr("id % 10 as a", "id % 30 as b", "rand() as c", "rand() as d") w = Window.partitionBy("a").orderBy("b", "c") df.withColumn('e', rank().over(w)).orderBy("a", "b", "c").show() {code} {code} +---+---+--------------------+-------------------+---+ | a| b| c| d| e| +---+---+--------------------+-------------------+---+ | 0| 0| 0.2194700128055158| 0.6702826320240122| 1| | 0| 0| 0.29311255730305474|0.17981084403674907| 2| | 0| 0| 0.5020715280611638| 0.7784768711132941| 3| | 0| 0| 0.6204718751835817| 0.658288203610106| 4| | 0| 0| 0.6347337400705437| 0.2252903559750168| 5| | 0| 0| 0.6821064252275095|0.17952666819892493| 6| | 0| 0| 0.8402819764752624|0.28817032040721025| 7| | 0| 10|0.049978271002973806|0.49749495851508385| 8| | 0| 10|0.055777048779296456| 0.4016624707648714| 9| | 0| 10| 0.2992817627985134| 0.5788639047928409| 10| | 0| 10| 0.39818210393355635|0.29767369914222863| 11| | 0| 10| 0.46253659167881533| 0.9075613037816466| 12| | 0| 10| 0.5438188620473111|0.27446749417467664| 13| | 0| 10| 0.9257608676148266| 0.3653579358201806| 14| | 0| 20|0.022895438467160578| 0.5552700300213504| 15| | 0| 20| 0.02611614998960987| 0.2130967533917808| 16| | 0| 20| 0.7634887765995678| 0.2451834678629934| 17| | 0| 20| 0.7746661338423646| 0.7678374501860449| 18| | 0| 20| 0.8038587340599448| 0.3852190054888385| 19| | 0| 20| 0.8409986361378347|0.31996442912312584| 20| +---+---+--------------------+-------------------+---+ only showing top 20 rows {code} > orderBy with multiple columns in WindowSpec does not work properly > ------------------------------------------------------------------ > > Key: SPARK-11481 > URL: https://issues.apache.org/jira/browse/SPARK-11481 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL > Affects Versions: 1.5.1 > Environment: All > Reporter: Jose Antonio > Labels: DataFrame, sparkSQL > > When using multiple columns in the orderBy of a WindowSpec the order by seems > to work only for the first column. > A possible workaround is to sort previosly the DataFrame and then apply the > window spec over the sorted DataFrame > e.g. > THIS NOT WORKS: > window_sum = Window.partitionBy('user_unique_id').orderBy('creation_date', > 'mib_id', 'day').rowsBetween(-sys.maxsize, 0) > df = df.withColumn('user_version', > func.sum(df.group_counter).over(window_sum)) > THIS WORKS WELL: > df = df.sort('user_unique_id', 'creation_date', 'mib_id', 'day') > window_sum = Window.partitionBy('user_unique_id').orderBy('creation_date', > 'mib_id', 'day').rowsBetween(-sys.maxsize, 0) > df = df.withColumn('user_version', > func.sum(df.group_counter).over(window_sum)) > Also, can anybody confirm that this is a true workaround? -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org