[
https://issues.apache.org/jira/browse/SPARK-12835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hyukjin Kwon updated SPARK-12835:
---------------------------------
Labels: bulk-closed (was: )
> StackOverflowError when aggregating over column from window function
> --------------------------------------------------------------------
>
> Key: SPARK-12835
> URL: https://issues.apache.org/jira/browse/SPARK-12835
> Project: Spark
> Issue Type: Bug
> Components: PySpark
> Affects Versions: 1.6.0
> Reporter: Kalle Jepsen
> Priority: Major
> Labels: bulk-closed
>
> I am encountering a StackoverflowError with a very long traceback, when I try
> to directly aggregate on a column created by a window function.
> E.g. I am trying to determine the average timespan between dates in a
> Dataframe column by using a window-function:
> {code}
> from pyspark import SparkContext
> from pyspark.sql import HiveContext, Window, functions
> from datetime import datetime
> sc = SparkContext()
> sq = HiveContext(sc)
> data = [
> [datetime(2014,1,1)],
> [datetime(2014,2,1)],
> [datetime(2014,3,1)],
> [datetime(2014,3,6)],
> [datetime(2014,8,23)],
> [datetime(2014,10,1)],
> ]
> df = sq.createDataFrame(data, schema=['ts'])
> ts = functions.col('ts')
>
> w = Window.orderBy(ts)
> diff = functions.datediff(
> ts,
> functions.lag(ts, count=1).over(w)
> )
> avg_diff = functions.avg(diff)
> {code}
> While {{df.select(diff.alias('diff')).show()}} correctly renders as
> {noformat}
> +----+
> |diff|
> +----+
> |null|
> | 31|
> | 28|
> | 5|
> | 170|
> | 39|
> +----+
> {noformat}
> doing {code}
> df.select(avg_diff).show()
> {code} throws a {{java.lang.StackOverflowError}}.
> When I say
> {code}
> df2 = df.select(diff.alias('diff'))
> df2.select(functions.avg('diff'))
> {code}
> however, there's no error.
> Am I wrong to assume that the above should work?
> I've already described the same in [this question on
> stackoverflow.com|http://stackoverflow.com/questions/34793999/averaging-over-window-function-leads-to-stackoverflowerror].
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]