Re: Clarify window behavior in Spark SQL

Li Jin Tue, 03 Apr 2018 19:23:55 -0700

Thanks all for the explanation. I am happy to update the API doc.

https://issues.apache.org/jira/browse/SPARK-23861


On Tue, Apr 3, 2018 at 8:54 PM, Reynold Xin <r...@databricks.com> wrote:

> Ah ok. Thanks for commenting. Everyday I learn something new about SQL.
>
> For others to follow, SQL Server has a good explanation of the behavior:
> https://docs.microsoft.com/en-us/sql/t-sql/queries
> /select-over-clause-transact-sql
>
>
> Can somebody (Li?) update the API documentation to specify the gotchas, in
> case users are not familiar with SQL window function semantics?
>
>
>
> General Remarks
> <https://docs.microsoft.com/en-us/sql/t-sql/queries/select-over-clause-transact-sql#general-remarks>
>
> More than one window function can be used in a single query with a single
> FROM clause. The OVER clause for each function can differ in partitioning
> and ordering.
>
> If PARTITION BY is not specified, the function treats all rows of the
> query result set as a single group.
> Important!
> <https://docs.microsoft.com/en-us/sql/t-sql/queries/select-over-clause-transact-sql#important>
>
> If ROWS/RANGE is specified and <window frame preceding> is used for
> <window frame extent> (short syntax) then this specification is used for
> the window frame boundary starting point and CURRENT ROW is used for the
> boundary ending point. For example “ROWS 5 PRECEDING” is equal to “ROWS
> BETWEEN 5 PRECEDING AND CURRENT ROW”.
>
> Note+
>
> If ORDER BY is not specified entire partition is used for a window frame.
> This applies only to functions that do not require ORDER BY clause. If
> ROWS/RANGE is not specified but ORDER BY is specified, RANGE UNBOUNDED
> PRECEDING AND CURRENT ROW is used as default for window frame. This applies
> only to functions that have can accept optional ROWS/RANGE specification.
> For example, ranking functions cannot accept ROWS/RANGE, therefore this
> window frame is not applied even though ORDER BY is present and ROWS/RANGE
> is not.
>
>
>
>
>
> On Tue, Apr 3, 2018 at 5:50 PM, Xingbo Jiang <jiangxb1...@gmail.com>
> wrote:
>
>> This is actually by design, without a `ORDER BY` clause, all rows are
>> considered as the peer row of the current row, which means that the frame
>> is effectively the entire partition. This behavior follows the window
>> syntax of PGSQL.
>> You can refer to the comment by yhuai: https://github.com/apac
>> he/spark/pull/5604#discussion_r157931911
>> :)
>>
>> 2018-04-04 6:27 GMT+08:00 Reynold Xin <r...@databricks.com>:
>>
>>> Do other (non-Hive) SQL systems do the same thing?
>>>
>>> On Tue, Apr 3, 2018 at 3:16 PM, Herman van Hövell tot Westerflier <
>>> her...@databricks.com> wrote:
>>>
>>>> This is something we inherited from Hive: https://cwiki.apache.org
>>>> /confluence/display/Hive/LanguageManual+WindowingAndAnalytics
>>>>
>>>> When ORDER BY is specified with missing WINDOW clause, the WINDOW
>>>>> specification defaults to RANGE BETWEEN UNBOUNDED PRECEDING AND
>>>>> CURRENT ROW.
>>>>
>>>> When both ORDER BY and WINDOW clauses are missing, the WINDOW
>>>>> specification defaults to ROW BETWEEN UNBOUNDED PRECEDING AND
>>>>> UNBOUNDED FOLLOWING.
>>>>
>>>>
>>>> It sort of makes sense if you think about it. If there is no ordering
>>>> there is no way to have a bound frame. If there is ordering we default to
>>>> the most commonly used deterministic frame.
>>>>
>>>>
>>>> On Tue, Apr 3, 2018 at 11:09 PM, Reynold Xin <r...@databricks.com>
>>>> wrote:
>>>>
>>>>> Seems like a bug.
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Apr 3, 2018 at 1:26 PM, Li Jin <ice.xell...@gmail.com> wrote:
>>>>>
>>>>>> Hi Devs,
>>>>>>
>>>>>> I am seeing some behavior with window functions that is a bit
>>>>>> unintuitive and would like to get some clarification.
>>>>>>
>>>>>> When using aggregation function with window, the frame boundary seems
>>>>>> to change depending on the order of the window.
>>>>>>
>>>>>> Example:
>>>>>> (1)
>>>>>>
>>>>>> df = spark.createDataFrame([[0, 1], [0, 2], [0, 3]]).toDF('id', 'v')
>>>>>>
>>>>>> w1 = Window.partitionBy('id')
>>>>>>
>>>>>> df.withColumn('v2', mean(df.v).over(w1)).show()
>>>>>>
>>>>>> +---+---+---+
>>>>>>
>>>>>> | id|  v| v2|
>>>>>>
>>>>>> +---+---+---+
>>>>>>
>>>>>> |  0|  1|2.0|
>>>>>>
>>>>>> |  0|  2|2.0|
>>>>>>
>>>>>> |  0|  3|2.0|
>>>>>>
>>>>>> +---+---+---+
>>>>>>
>>>>>> (2)
>>>>>> df = spark.createDataFrame([[0, 1], [0, 2], [0, 3]]).toDF('id', 'v')
>>>>>>
>>>>>> w2 = Window.partitionBy('id').orderBy('v')
>>>>>>
>>>>>> df.withColumn('v2', mean(df.v).over(w2)).show()
>>>>>>
>>>>>> +---+---+---+
>>>>>>
>>>>>> | id|  v| v2|
>>>>>>
>>>>>> +---+---+---+
>>>>>>
>>>>>> |  0|  1|1.0|
>>>>>>
>>>>>> |  0|  2|1.5|
>>>>>>
>>>>>> |  0|  3|2.0|
>>>>>>
>>>>>> +---+---+---+
>>>>>>
>>>>>> Seems like orderBy('v') in the example (2) also changes the frame
>>>>>> boundaries from (
>>>>>>
>>>>>> unboundedPreceding, unboundedFollowing) to (unboundedPreceding,
>>>>>> currentRow).
>>>>>>
>>>>>>
>>>>>> I found this behavior a bit unintuitive. I wonder if this behavior is
>>>>>> by design and if so, what's the specific rule that orderBy() interacts 
>>>>>> with
>>>>>> frame boundaries?
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Li
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Clarify window behavior in Spark SQL

Reply via email to