Kalle Jepsen created SPARK-12835:
------------------------------------
Summary: StackOverflowError when aggregating over column from
window function
Key: SPARK-12835
URL: https://issues.apache.org/jira/browse/SPARK-12835
Project: Spark
Issue Type: Bug
Components: PySpark
Affects Versions: 1.6.0
Reporter: Kalle Jepsen
I am encountering a StackoverflowError with a very long traceback, when I try
to directly aggregate on a column created by a window function.
E.g. I am trying to determine the average timespan between dates in a Dataframe
column by using a window-function:
{code}
from pyspark import SparkContext
from pyspark.sql import HiveContext, Window, functions
from datetime import datetime
sc = SparkContext()
sq = HiveContext(sc)
data = [
[datetime(2014,1,1)],
[datetime(2014,2,1)],
[datetime(2014,3,1)],
[datetime(2014,3,6)],
[datetime(2014,8,23)],
[datetime(2014,10,1)],
]
df = sq.createDataFrame(data, schema=['ts'])
ts = functions.col('ts')
w = Window.orderBy(ts)
diff = functions.datediff(
ts,
functions.lag(ts, count=1).over(w)
)
avg_diff = functions.avg(diff)
{code}
While {{df.select(diff.alias('diff')).show()}} correctly renders as
{noformat}
+----+
|diff|
+----+
|null|
| 31|
| 28|
| 5|
| 170|
| 39|
+----+
{noformat}
doing {code}
df.select(avg_diff).show()
{code} throws a {{java.lang.StackOverflowError}}.
When I say
{code}
df2 = df.select(diff.alias('diff'))
df2.select(functions.avg('diff'))
{code}
however, there's no error.
Am I wrong to assume that the above should work?
I've already described the same in [this question on
stackoverflow.com|http://stackoverflow.com/questions/34793999/averaging-over-window-function-leads-to-stackoverflowerror].
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]