[
https://issues.apache.org/jira/browse/SPARK-11452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Yin Huai resolved SPARK-11452.
------------------------------
Resolution: Duplicate
Assignee: Davies Liu
Fix Version/s: 1.5.2
> Window functions give invalid values
> ------------------------------------
>
> Key: SPARK-11452
> URL: https://issues.apache.org/jira/browse/SPARK-11452
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 1.5.0
> Environment: yarn-client on Amazon EMR
> Reporter: Jarno Seppanen
> Assignee: Davies Liu
> Fix For: 1.5.2
>
>
> I'm getting incorrect results from several window functions:
> * negative values from ROW_NUMBER()
> * incorrect floats from SUM(...) OVER (...)
> * huge integers from COUNT(...) OVER (...)
> Here's an example pyspark session that shows the incorrect behaviors:
> {code}
> >>> logout_acc.printSchema()
> root
> |-- account: long (nullable = true)
> |-- avatar: long (nullable = true)
> |-- timestamp: timestamp (nullable = true)
> |-- session_dur: double (nullable = true)
> >>> session_numbers = sqlContext.sql("""
> ... SELECT account, avatar, timestamp,
> ... COUNT(*) OVER (
> ... PARTITION BY account ORDER BY timestamp) AS session,
> ... ROW_NUMBER() OVER (
> ... PARTITION BY account ORDER BY timestamp) AS row_num,
> ... SUM(session_dur) OVER (
> ... PARTITION BY account ORDER BY timestamp) AS playtime
> ... FROM logout_acc
> ... """)
> >>> session_numbers.cache().registerTempTable('session_numbers')
> >>> session_numbers.show()
> +-------+------+--------------------+-------+-----------+--------------+
> |account|avatar| timestamp|session| row_num| playtime|
> +-------+------+--------------------+-------+-----------+--------------+
> | ...1| ...1|2015-09-01 00:04:...| 1| 1261342896| 4.991264E-317|
> | ...2| ...2|2015-09-01 00:40:...| 1| -900190400| 4.991363E-317|
> | ...2| ...2|2015-09-01 00:43:...| 2| -695150400| 4.991363E-317|
> | ...2| ...2|2015-09-01 00:50:...| 3| -278701400| 4.991363E-317|
> | ...2| ...2|2015-09-01 01:25:...| 4| 1824111600| 4.991363E-317|
> | ...2| ...2|2015-09-01 01:28:...| 5| 1992217600| 4.991363E-317|
> | ...2| ...2|2015-09-01 01:31:...| 6|-2127806696| 4.991363E-317|
> | ...2| ...2|2015-09-03 00:47:...| 7| 574647760| 4.991363E-317|
> | ...3| ...3|2015-09-01 01:16:...| 1| 1269805600|4.9914617E-317|
> | ...3| ...3|2015-09-01 01:22:...| 2| 1621270600|4.9914617E-317|
> | ...3| ...3|2015-09-01 14:52:...| 3|-1269231952|4.9914617E-317|
> | ...3| ...3|2015-09-01 14:59:...| 4| -865949952|4.9914617E-317|
> | ...3| ...3|2015-09-04 05:05:...| 5| -678665344|4.9914617E-317|
> | ...3| ...3|2015-09-05 16:19:...| 6| 1645213072|4.9914617E-317|
> | ...3| ...3|2015-09-05 17:09:...| 7| 339651776|4.9914617E-317|
> | ...3| ...3|2015-09-05 20:27:...| 8| -669300112|4.9914617E-317|
> | ...3| ...3|2015-09-05 21:12:...| 9| 2032531888|4.9914617E-317|
> | ...3| ...3|2015-09-09 02:45:...| 10| 2046885648|4.9914617E-317|
> | ...3| ...3|2015-09-09 03:47:...| 11| 1493942352|4.9914617E-317|
> | ...3| ...3|2015-09-09 21:09:...| 12| -434263088|4.9914617E-317|
> +-------+------+--------------------+-------+-----------+--------------+
> >>> sqlContext.sql('select max(session) from session_numbers').show()
> +-------------------+
>
> | _c0|
> +-------------------+
> |4687099421149855744|
> +-------------------+
> {code}
> In fact, this was a transient problem, because after I re-ran the above
> commands, the results came out correct. Unfortunately I also cannot share the
> data to try and reproduce the problem, but I hope this report might be useful
> nevertheless. Thanks.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]