[GitHub] [spark] zhengruifeng opened a new pull request, #41674: [SPARK-44106][PYTHON][CONNECT] Add `repr` for `GroupedData`

via GitHub Tue, 20 Jun 2023 06:31:32 -0700


zhengruifeng opened a new pull request, #41674:
URL: https://github.com/apache/spark/pull/41674


   ### What changes were proposed in this pull request?
   Add `__repr__` for `GroupedData`
   
   
   ### Why are the changes needed?
    `GroupedData.__repr__` is missing
   
   
   ### Does this PR introduce _any_ user-facing change?
   yes
   
   1. On Scala side:
   ```
   scala> val df = Seq(("414243", "4243")).toDF("e", "f")
   df: org.apache.spark.sql.DataFrame = [e: string, f: string]
   
   scala> df.groupBy("e")
   res0: org.apache.spark.sql.RelationalGroupedDataset = 
RelationalGroupedDataset: [grouping expressions: [e: string], value: [e: 
string, f: string], type: GroupBy]
   
   scala> df.groupBy(df.col("e"))
   res1: org.apache.spark.sql.RelationalGroupedDataset = 
RelationalGroupedDataset: [grouping expressions: [e: string], value: [e: 
string, f: string], type: GroupBy]
   ```
   
   2. On vanilla PySpark:
   
   before this PR:
   ```
   In [1]: df = spark.createDataFrame([("414243", "4243",)], ["e", "f"])
   
   In [2]: df
   Out[2]: DataFrame[e: string, f: string]
   
   In [3]: df.groupBy("e")
   Out[3]: <pyspark.sql.group.GroupedData at 0x10423a4c0>
   
   In [4]: df.groupBy(df.e)
   Out[4]: <pyspark.sql.group.GroupedData at 0x1041dd640>
   
   ```
   
   after this PR:
   ```
   In [1]: df = spark.createDataFrame([("414243", "4243",)], ["e", "f"])
   
   In [2]: df
   Out[2]: DataFrame[e: string, f: string]
   
   In [3]: df.groupBy("e")
   Out[3]: GroupedData[grouping expressions: [e], value: [e: string, f: 
string], type: GroupBy]
   
   In [4]: df.groupBy(df.e)
   Out[4]: GroupedData[grouping expressions: [e: string], value: [e: string, f: 
string], type: GroupBy]
   ```
   
   On Spark Connect Python Client:
   before this PR:
   ```
   In [1]: df = spark.createDataFrame([("414243", "4243",)], ["e", "f"])
   
   In [2]: df
   Out[2]: DataFrame[e: string, f: string]
   
   In [3]: df.groupBy("e")
   Out[3]: <pyspark.sql.connect.group.GroupedData at 0x1046157c0>
   
   In [4]: df.groupBy(df.e)
   Out[4]: <pyspark.sql.connect.group.GroupedData at 0x11da5ceb0>
   ```
   
   after this PR:
   ```
   In [1]: df = spark.createDataFrame([("414243", "4243",)], ["e", "f"])
   
   In [2]: df
   Out[2]: DataFrame[e: string, f: string]
   
   In [3]: df.groupBy("e")
   Out[3]: GroupedData[grouping expressions: [e], value: [e: string, f: 
string], type: GroupBy]
   
   In [4]: df.groupBy(df.e)
   Out[4]: GroupedData[grouping expressions: [e], value: [e: string, f: 
string], type: GroupBy]
   ```
   
   Note that since the expressions in Python Client are not resolved, the 
string can be different from vanilla PySpark. 
   
   
   ### How was this patch tested?
   added doctests
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] zhengruifeng opened a new pull request, #41674: [SPARK-44106][PYTHON][CONNECT] Add `__repr__` for `GroupedData`

Reply via email to

[GitHub] [spark] zhengruifeng opened a new pull request, #41674: [SPARK-44106][PYTHON][CONNECT] Add `repr` for `GroupedData`