Thanks all for the explanation. I am happy to update the API doc. https://issues.apache.org/jira/browse/SPARK-23861
On Tue, Apr 3, 2018 at 8:54 PM, Reynold Xin <r...@databricks.com> wrote: > Ah ok. Thanks for commenting. Everyday I learn something new about SQL. > > For others to follow, SQL Server has a good explanation of the behavior: > https://docs.microsoft.com/en-us/sql/t-sql/queries > /select-over-clause-transact-sql > > > Can somebody (Li?) update the API documentation to specify the gotchas, in > case users are not familiar with SQL window function semantics? > > > > General Remarks > <https://docs.microsoft.com/en-us/sql/t-sql/queries/select-over-clause-transact-sql#general-remarks> > > More than one window function can be used in a single query with a single > FROM clause. The OVER clause for each function can differ in partitioning > and ordering. > > If PARTITION BY is not specified, the function treats all rows of the > query result set as a single group. > Important! > <https://docs.microsoft.com/en-us/sql/t-sql/queries/select-over-clause-transact-sql#important> > > If ROWS/RANGE is specified and <window frame preceding> is used for > <window frame extent> (short syntax) then this specification is used for > the window frame boundary starting point and CURRENT ROW is used for the > boundary ending point. For example “ROWS 5 PRECEDING” is equal to “ROWS > BETWEEN 5 PRECEDING AND CURRENT ROW”. > > Note+ > > If ORDER BY is not specified entire partition is used for a window frame. > This applies only to functions that do not require ORDER BY clause. If > ROWS/RANGE is not specified but ORDER BY is specified, RANGE UNBOUNDED > PRECEDING AND CURRENT ROW is used as default for window frame. This applies > only to functions that have can accept optional ROWS/RANGE specification. > For example, ranking functions cannot accept ROWS/RANGE, therefore this > window frame is not applied even though ORDER BY is present and ROWS/RANGE > is not. > > > > > > On Tue, Apr 3, 2018 at 5:50 PM, Xingbo Jiang <jiangxb1...@gmail.com> > wrote: > >> This is actually by design, without a `ORDER BY` clause, all rows are >> considered as the peer row of the current row, which means that the frame >> is effectively the entire partition. This behavior follows the window >> syntax of PGSQL. >> You can refer to the comment by yhuai: https://github.com/apac >> he/spark/pull/5604#discussion_r157931911 >> :) >> >> 2018-04-04 6:27 GMT+08:00 Reynold Xin <r...@databricks.com>: >> >>> Do other (non-Hive) SQL systems do the same thing? >>> >>> On Tue, Apr 3, 2018 at 3:16 PM, Herman van Hövell tot Westerflier < >>> her...@databricks.com> wrote: >>> >>>> This is something we inherited from Hive: https://cwiki.apache.org >>>> /confluence/display/Hive/LanguageManual+WindowingAndAnalytics >>>> >>>> When ORDER BY is specified with missing WINDOW clause, the WINDOW >>>>> specification defaults to RANGE BETWEEN UNBOUNDED PRECEDING AND >>>>> CURRENT ROW. >>>> >>>> When both ORDER BY and WINDOW clauses are missing, the WINDOW >>>>> specification defaults to ROW BETWEEN UNBOUNDED PRECEDING AND >>>>> UNBOUNDED FOLLOWING. >>>> >>>> >>>> It sort of makes sense if you think about it. If there is no ordering >>>> there is no way to have a bound frame. If there is ordering we default to >>>> the most commonly used deterministic frame. >>>> >>>> >>>> On Tue, Apr 3, 2018 at 11:09 PM, Reynold Xin <r...@databricks.com> >>>> wrote: >>>> >>>>> Seems like a bug. >>>>> >>>>> >>>>> >>>>> On Tue, Apr 3, 2018 at 1:26 PM, Li Jin <ice.xell...@gmail.com> wrote: >>>>> >>>>>> Hi Devs, >>>>>> >>>>>> I am seeing some behavior with window functions that is a bit >>>>>> unintuitive and would like to get some clarification. >>>>>> >>>>>> When using aggregation function with window, the frame boundary seems >>>>>> to change depending on the order of the window. >>>>>> >>>>>> Example: >>>>>> (1) >>>>>> >>>>>> df = spark.createDataFrame([[0, 1], [0, 2], [0, 3]]).toDF('id', 'v') >>>>>> >>>>>> w1 = Window.partitionBy('id') >>>>>> >>>>>> df.withColumn('v2', mean(df.v).over(w1)).show() >>>>>> >>>>>> +---+---+---+ >>>>>> >>>>>> | id| v| v2| >>>>>> >>>>>> +---+---+---+ >>>>>> >>>>>> | 0| 1|2.0| >>>>>> >>>>>> | 0| 2|2.0| >>>>>> >>>>>> | 0| 3|2.0| >>>>>> >>>>>> +---+---+---+ >>>>>> >>>>>> (2) >>>>>> df = spark.createDataFrame([[0, 1], [0, 2], [0, 3]]).toDF('id', 'v') >>>>>> >>>>>> w2 = Window.partitionBy('id').orderBy('v') >>>>>> >>>>>> df.withColumn('v2', mean(df.v).over(w2)).show() >>>>>> >>>>>> +---+---+---+ >>>>>> >>>>>> | id| v| v2| >>>>>> >>>>>> +---+---+---+ >>>>>> >>>>>> | 0| 1|1.0| >>>>>> >>>>>> | 0| 2|1.5| >>>>>> >>>>>> | 0| 3|2.0| >>>>>> >>>>>> +---+---+---+ >>>>>> >>>>>> Seems like orderBy('v') in the example (2) also changes the frame >>>>>> boundaries from ( >>>>>> >>>>>> unboundedPreceding, unboundedFollowing) to (unboundedPreceding, >>>>>> currentRow). >>>>>> >>>>>> >>>>>> I found this behavior a bit unintuitive. I wonder if this behavior is >>>>>> by design and if so, what's the specific rule that orderBy() interacts >>>>>> with >>>>>> frame boundaries? >>>>>> >>>>>> >>>>>> Thanks, >>>>>> >>>>>> Li >>>>>> >>>>>> >>>>> >>>> >>> >> >