[GitHub] [spark] HyukjinKwon commented on a change in pull request #23946: [SPARK-26860][PySpark] [SparkR] Fix for RangeBetween and RowsBetween docs to be in sync with spark documentation

GitBox Wed, 06 Mar 2019 21:13:17 -0800

HyukjinKwon commented on a change in pull request #23946: 
[SPARK-26860][PySpark] [SparkR] Fix for RangeBetween and RowsBetween docs to be 
in sync with spark documentation
URL: https://github.com/apache/spark/pull/23946#discussion_r263240174


 ##########
 File path: python/pyspark/sql/window.py
 ##########
 @@ -97,6 +97,33 @@ def rowsBetween(start, end):
         and ``Window.currentRow`` to specify special boundary values, rather 
than using integral
         values directly.
 
+        A row based boundary is based on the position of the row within the 
partition.
+        An offset indicates the number of rows above or below the current row, 
the frame for the
+        current row starts or ends. For instance, given a row based sliding 
frame with a lower bound
+        offset of -1 and a upper bound offset of +2. The frame for row with 
index 5 would range from
+        index 4 to index 6.
+
+        >>> from pyspark.sql import Window
+        >>> from pyspark.sql import functions as func
+        >>> from pyspark.sql import SQLContext
+        >>> sc = SparkContext.getOrCreate()
+        >>> sqlContext = SQLContext(sc)
+        >>> tup = [(1, "a"), (1, "a"), (2, "a"), (1, "b"), (2, "b"), (3, "b")]
+        >>> df = sqlContext.createDataFrame(tup, ["id", "category"])
+        >>> window = 
Window.partitionBy("category").orderBy("id").rowsBetween(Window.currentRow, 1)
+        >>> df.withColumn("sum", func.sum("id").over(window)).show()
+        +---+--------+---+
+        | id|category|sum|
+        +---+--------+---+
+        |  1|       b|  3|
+        |  2|       b|  5|
+        |  3|       b|  3|
+        |  1|       a|  2|
+        |  1|       a|  3|
+        |  2|       a|  2|
+        +---+--------+---+
+        <BLANKLINE>
 
 Review comment:
   You can change the doctest running codes from:
   
   ```python
   import doctest
   SparkContext('local[4]', 'PythonTest')
   (failure_count, test_count) = doctest.testmod()
   ```
   
   to:
   
   ```python
   import doctest
   import pyspark.sql.window
   
   SparkContext('local[4]', 'PythonTest')
   globs = pyspark.sql.window.__dict__.copy()
   (failure_count, test_count) = doctest.testmod(
       pyspark.sql.window, globs=globs,
       optionflags=doctest.NORMALIZE_WHITESPACE))
   ```
   
   so that:
   
   1. it doesn't need to add `<BLANKLINE>`
   2. when the tests are skipped, it shows the correct fully qualified module 
names like `pyspark.sql.window...`, rather then `__main__. ...`.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] HyukjinKwon commented on a change in pull request #23946: [SPARK-26860][PySpark] [SparkR] Fix for RangeBetween and RowsBetween docs to be in sync with spark documentation

Reply via email to