[GitHub] [spark] zhengruifeng commented on pull request #38841: [SPARK-41325] [CONNECT] Fix missing avg() for GroupBy on DF

GitBox Tue, 29 Nov 2022 22:09:48 -0800


zhengruifeng commented on PR #38841:
URL: https://github.com/apache/spark/pull/38841#issuecomment-1331687981


   > > 1, pyspark and scala api only accept string `*str` / `string*`
   > 
   > @zhengruifeng can you elaborate? I tested the same code in PySpark and it 
works as well.
   > 
   > > 2, pyspark and scala api will check the schema, if the datatype is 
unexpected, it fails;
   > 
   > What do you mean?
   > 
   > > 3, if the no input column, it will check the schema and select all the 
numeric columns.
   > 
   > This is more missing functionality than this particular bug correct?
   
   1, pyspark and scala don't take expression or column as input:
   
   ```
   In [11]: df = spark.createDataFrame([(10, 80, "Alice"), (5, None, "Bob"), 
(None, 10, "Tom"), (None, None, None)], schema=["age", "height", "name"])
   
   In [12]: df.show()
   +----+------+-----+
   | age|height| name|
   +----+------+-----+
   |  10|    80|Alice|
   |   5|  null|  Bob|
   |null|    10|  Tom|
   |null|  null| null|
   +----+------+-----+
   
   
   In [13]: df.groupBy("age").min(df.height)
   ---------------------------------------------------------------------------
   TypeError                                 Traceback (most recent call last)
   Cell In[13], line 1
   ----> 1 df.groupBy("age").min(df.height)
   
   File ~/Dev/spark/python/pyspark/sql/group.py:49, in 
df_varargs_api.<locals>._api(self, *cols)
        47 def _api(self: "GroupedData", *cols: str) -> DataFrame:
        48     name = f.__name__
   ---> 49     jdf = getattr(self._jgd, name)(_to_seq(self.session._sc, cols))
        50     return DataFrame(jdf, self.session)
   
   ...
   
   TypeError: Column is not iterable
   
   ```
   
   2, if input columns contains non-acceptable datatypes, it fails like
   
   ```
   In [14]: df.groupBy("age").min("name")
   ---------------------------------------------------------------------------
   AnalysisException                         Traceback (most recent call last)
   Cell In[14], line 1
   ----> 1 df.groupBy("age").min("name")
   
   File ~/Dev/spark/python/pyspark/sql/group.py:49, in 
df_varargs_api.<locals>._api(self, *cols)
        47 def _api(self: "GroupedData", *cols: str) -> DataFrame:
        48     name = f.__name__
   ---> 49     jdf = getattr(self._jgd, name)(_to_seq(self.session._sc, cols))
        50     return DataFrame(jdf, self.session)
   
   File ~/Dev/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py:1322, 
in JavaMember.__call__(self, *args)
      1316 command = proto.CALL_COMMAND_NAME +\
      1317     self.command_header +\
      1318     args_command +\
      1319     proto.END_COMMAND_PART
      1321 answer = self.gateway_client.send_command(command)
   -> 1322 return_value = get_return_value(
      1323     answer, self.gateway_client, self.target_id, self.name)
      1325 for temp_arg in temp_args:
      1326     if hasattr(temp_arg, "_detach"):
   
   File ~/Dev/spark/python/pyspark/sql/utils.py:205, in 
capture_sql_exception.<locals>.deco(*a, **kw)
       201 converted = convert_exception(e.java_exception)
       202 if not isinstance(converted, UnknownException):
       203     # Hide where the exception came from that shows a non-Pythonic
       204     # JVM exception message.
   --> 205     raise converted from None
       206 else:
       207     raise
   
   AnalysisException: "name" is not a numeric column. Aggregation function can 
only be applied on a numeric column.
   
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] zhengruifeng commented on pull request #38841: [SPARK-41325] [CONNECT] Fix missing avg() for GroupBy on DF

Reply via email to