subject:"spark git commit\: \[SPARK\-25328\]\[PYTHON\] Add an example for having two columns as the grouping key in group aggregate pandas UDF"

spark git commit: [SPARK-25328][PYTHON] Add an example for having two columns as the grouping key in group aggregate pandas UDF

2018-09-06 Thread cutlerb

Repository: spark
Updated Branches:
  refs/heads/branch-2.4 085f731ad -> f2d502223


[SPARK-25328][PYTHON] Add an example for having two columns as the grouping key 
in group aggregate pandas UDF

## What changes were proposed in this pull request?

This PR proposes to add another example for multiple grouping key in group 
aggregate pandas UDF since this feature could make users still confused.

## How was this patch tested?

Manually tested and documentation built.

Closes #22329 from HyukjinKwon/SPARK-25328.

Authored-by: hyukjinkwon 
Signed-off-by: Bryan Cutler 
(cherry picked from commit 7ef6d1daf858cc9a2c390074f92aaf56c219518a)
Signed-off-by: Bryan Cutler 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f2d50222
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f2d50222
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f2d50222

Branch: refs/heads/branch-2.4
Commit: f2d5022233b637eb50567f7945042b3a8c9c6b25
Parents: 085f731
Author: hyukjinkwon 
Authored: Thu Sep 6 08:18:49 2018 -0700
Committer: Bryan Cutler 
Committed: Thu Sep 6 09:59:19 2018 -0700

--
 python/pyspark/sql/functions.py | 24 
 1 file changed, 20 insertions(+), 4 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/f2d50222/python/pyspark/sql/functions.py
--
diff --git a/python/pyspark/sql/functions.py b/python/pyspark/sql/functions.py
index 864780e..9396b16 100644
--- a/python/pyspark/sql/functions.py
+++ b/python/pyspark/sql/functions.py
@@ -2783,14 +2783,14 @@ def pandas_udf(f=None, returnType=None, 
functionType=None):
+---+---+
 
Alternatively, the user can define a function that takes two arguments.
-   In this case, the grouping key will be passed as the first argument and 
the data will
-   be passed as the second argument. The grouping key will be passed as a 
tuple of numpy
+   In this case, the grouping key(s) will be passed as the first argument 
and the data will
+   be passed as the second argument. The grouping key(s) will be passed as 
a tuple of numpy
data types, e.g., `numpy.int32` and `numpy.float64`. The data will 
still be passed in
as a `pandas.DataFrame` containing all columns from the original Spark 
DataFrame.
-   This is useful when the user does not want to hardcode grouping key in 
the function.
+   This is useful when the user does not want to hardcode grouping key(s) 
in the function.
 
-   >>> from pyspark.sql.functions import pandas_udf, PandasUDFType
>>> import pandas as pd  # doctest: +SKIP
+   >>> from pyspark.sql.functions import pandas_udf, PandasUDFType
>>> df = spark.createDataFrame(
... [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
... ("id", "v"))  # doctest: +SKIP
@@ -2806,6 +2806,22 @@ def pandas_udf(f=None, returnType=None, 
functionType=None):
|  1|1.5|
|  2|6.0|
+---+---+
+   >>> @pandas_udf(
+   ..."id long, `ceil(v / 2)` long, v double",
+   ...PandasUDFType.GROUPED_MAP)  # doctest: +SKIP
+   >>> def sum_udf(key, pdf):
+   ... # key is a tuple of two numpy.int64s, which is the values
+   ... # of 'id' and 'ceil(df.v / 2)' for the current group
+   ... return pd.DataFrame([key + (pdf.v.sum(),)])
+   >>> df.groupby(df.id, ceil(df.v / 2)).apply(sum_udf).show()  # doctest: 
+SKIP
+   +---+---++
+   | id|ceil(v / 2)|   v|
+   +---+---++
+   |  2|  5|10.0|
+   |  1|  1| 3.0|
+   |  2|  3| 5.0|
+   |  2|  2| 3.0|
+   +---+---++
 
.. note:: If returning a new `pandas.DataFrame` constructed with a 
dictionary, it is
recommended to explicitly index the columns by name to ensure the 
positions are correct,


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-25328][PYTHON] Add an example for having two columns as the grouping key in group aggregate pandas UDF

2018-09-06 Thread cutlerb

Repository: spark
Updated Branches:
  refs/heads/master f5817d8bb -> 7ef6d1daf


[SPARK-25328][PYTHON] Add an example for having two columns as the grouping key 
in group aggregate pandas UDF

## What changes were proposed in this pull request?

This PR proposes to add another example for multiple grouping key in group 
aggregate pandas UDF since this feature could make users still confused.

## How was this patch tested?

Manually tested and documentation built.

Closes #22329 from HyukjinKwon/SPARK-25328.

Authored-by: hyukjinkwon 
Signed-off-by: Bryan Cutler 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/7ef6d1da
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/7ef6d1da
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/7ef6d1da

Branch: refs/heads/master
Commit: 7ef6d1daf858cc9a2c390074f92aaf56c219518a
Parents: f5817d8
Author: hyukjinkwon 
Authored: Thu Sep 6 08:18:49 2018 -0700
Committer: Bryan Cutler 
Committed: Thu Sep 6 08:18:49 2018 -0700

--
 python/pyspark/sql/functions.py | 24 
 1 file changed, 20 insertions(+), 4 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/7ef6d1da/python/pyspark/sql/functions.py
--
diff --git a/python/pyspark/sql/functions.py b/python/pyspark/sql/functions.py
index 864780e..9396b16 100644
--- a/python/pyspark/sql/functions.py
+++ b/python/pyspark/sql/functions.py
@@ -2783,14 +2783,14 @@ def pandas_udf(f=None, returnType=None, 
functionType=None):
+---+---+
 
Alternatively, the user can define a function that takes two arguments.
-   In this case, the grouping key will be passed as the first argument and 
the data will
-   be passed as the second argument. The grouping key will be passed as a 
tuple of numpy
+   In this case, the grouping key(s) will be passed as the first argument 
and the data will
+   be passed as the second argument. The grouping key(s) will be passed as 
a tuple of numpy
data types, e.g., `numpy.int32` and `numpy.float64`. The data will 
still be passed in
as a `pandas.DataFrame` containing all columns from the original Spark 
DataFrame.
-   This is useful when the user does not want to hardcode grouping key in 
the function.
+   This is useful when the user does not want to hardcode grouping key(s) 
in the function.
 
-   >>> from pyspark.sql.functions import pandas_udf, PandasUDFType
>>> import pandas as pd  # doctest: +SKIP
+   >>> from pyspark.sql.functions import pandas_udf, PandasUDFType
>>> df = spark.createDataFrame(
... [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
... ("id", "v"))  # doctest: +SKIP
@@ -2806,6 +2806,22 @@ def pandas_udf(f=None, returnType=None, 
functionType=None):
|  1|1.5|
|  2|6.0|
+---+---+
+   >>> @pandas_udf(
+   ..."id long, `ceil(v / 2)` long, v double",
+   ...PandasUDFType.GROUPED_MAP)  # doctest: +SKIP
+   >>> def sum_udf(key, pdf):
+   ... # key is a tuple of two numpy.int64s, which is the values
+   ... # of 'id' and 'ceil(df.v / 2)' for the current group
+   ... return pd.DataFrame([key + (pdf.v.sum(),)])
+   >>> df.groupby(df.id, ceil(df.v / 2)).apply(sum_udf).show()  # doctest: 
+SKIP
+   +---+---++
+   | id|ceil(v / 2)|   v|
+   +---+---++
+   |  2|  5|10.0|
+   |  1|  1| 3.0|
+   |  2|  3| 5.0|
+   |  2|  2| 3.0|
+   +---+---++
 
.. note:: If returning a new `pandas.DataFrame` constructed with a 
dictionary, it is
recommended to explicitly index the columns by name to ensure the 
positions are correct,


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-25328][PYTHON] Add an example for having two columns as the grouping key in group aggregate pandas UDF

spark git commit: [SPARK-25328][PYTHON] Add an example for having two columns as the grouping key in group aggregate pandas UDF

2 matches

Site Navigation

Mail list logo

Footer information