[ 
https://issues.apache.org/jira/browse/SPARK-7965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-7965:
------------------------------
    Labels: correctness  (was: )

> Wrong answers for queries with multiple window specs in the same expression
> ---------------------------------------------------------------------------
>
>                 Key: SPARK-7965
>                 URL: https://issues.apache.org/jira/browse/SPARK-7965
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.4.0
>            Reporter: Josh Rosen
>            Assignee: Yin Huai
>              Labels: correctness
>             Fix For: 1.4.0
>
>
> I think that Spark SQL may be returning incorrect answers for queries that 
> use multiple window specifications within the same expression.  Here's an 
> example that illustrates the problem.
> Say that I have a table with a single numeric column and that I want to 
> compute a cumulative distribution function over this column.  Let's call this 
> table {{nums}}:
> {code}
> val nums = sc.parallelize(1 to 10).map(x => (x)).toDF("x")
> nums.registerTempTable("nums")
> {code}
> It's easy to compute a running sum over this column:
> {code}
> sqlContext.sql("""
>     select sum(x) over (rows between unbounded preceding and current row) 
> from nums
> """).collect()
> nums: org.apache.spark.sql.DataFrame = [x: int]
> res29: Array[org.apache.spark.sql.Row] = Array([1], [3], [6], [10], [15], 
> [21], [28], [36], [45], [55])
> {code}
> It's also easy to compute a total sum over all rows:
> {code}
> sqlContext.sql("""
>     select sum(x) over (rows between unbounded preceding and unbounded 
> following) from nums
> """).collect()
> res34: Array[org.apache.spark.sql.Row] = Array([55], [55], [55], [55], [55], 
> [55], [55], [55], [55], [55])
> {code}
> Let's say that I combine these expressions to compute a CDF:
> {code}
> sqlContext.sql("""
>       select (sum(x) over (rows between unbounded preceding and current row))
>     /
>     (sum(x) over (rows between unbounded preceding and unbounded following)) 
> from nums
> """).collect()
> res31: Array[org.apache.spark.sql.Row] = Array([1.0], [1.0], [1.0], [1.0], 
> [1.0], [1.0], [1.0], [1.0], [1.0], [1.0])
> {code}
> This seems wrong.  Note that if we combine the running total, global total, 
> and combined expression in the same query, then we see that the first two 
> values are computed correctly / but the combined expression seems to be 
> incorrect:
> {code}
> sqlContext.sql("""
>     select
>     sum(x) over (rows between unbounded preceding and current row) as 
> running_sum,
>     (sum(x) over (rows between unbounded preceding and unbounded following)) 
> as total_sum,
>     ((sum(x) over (rows between unbounded preceding and current row))
>     /
>     (sum(x) over (rows between unbounded preceding and unbounded following))) 
> as combined
>     from nums 
> """).collect()
> res40: Array[org.apache.spark.sql.Row] = Array([1,55,1.0], [3,55,1.0], 
> [6,55,1.0], [10,55,1.0], [15,55,1.0], [21,55,1.0], [28,55,1.0], [36,55,1.0], 
> [45,55,1.0], [55,55,1.0])
> {code}
> /cc [~yhuai]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to