Kalle Jepsen created SPARK-12835:
------------------------------------

             Summary: StackOverflowError when aggregating over column from 
window function
                 Key: SPARK-12835
                 URL: https://issues.apache.org/jira/browse/SPARK-12835
             Project: Spark
          Issue Type: Bug
          Components: PySpark
    Affects Versions: 1.6.0
            Reporter: Kalle Jepsen


I am encountering a StackoverflowError with a very long traceback, when I try 
to directly aggregate on a column created by a window function.

E.g. I am trying to determine the average timespan between dates in a Dataframe 
column by using a window-function:

{code}
from pyspark import SparkContext
from pyspark.sql import HiveContext, Window, functions
from datetime import datetime

sc = SparkContext()
sq = HiveContext(sc)
data = [
    [datetime(2014,1,1)],
    [datetime(2014,2,1)],
    [datetime(2014,3,1)],
    [datetime(2014,3,6)],
    [datetime(2014,8,23)],
    [datetime(2014,10,1)],
]
df = sq.createDataFrame(data, schema=['ts'])
ts = functions.col('ts')
   
w = Window.orderBy(ts)
diff = functions.datediff(
    ts,
    functions.lag(ts, count=1).over(w)
)

avg_diff = functions.avg(diff)

{code}

While {{df.select(diff.alias('diff')).show()}} correctly renders as

{noformat}
    +----+
    |diff|
    +----+
    |null|
    |  31|
    |  28|
    |   5|
    | 170|
    |  39|
    +----+
{noformat}

doing {code}
df.select(avg_diff).show()
{code} throws a {{java.lang.StackOverflowError}}.

When I say

{code}
df2 = df.select(diff.alias('diff'))
df2.select(functions.avg('diff'))
{code}

however, there's no error.

Am I wrong to assume that the above should work?

I've already described the same in [this question on 
stackoverflow.com|http://stackoverflow.com/questions/34793999/averaging-over-window-function-leads-to-stackoverflowerror].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to