[GitHub] [spark] oeuf commented on a change in pull request #34812: [WIP][PARK-37553][PYTHON] Fix underscore (`_`) bug in pyspark.pandas.frames.DataFrame.pivot_table

GitBox Sun, 05 Dec 2021 19:35:06 -0800


oeuf commented on a change in pull request #34812:
URL: https://github.com/apache/spark/pull/34812#discussion_r762681880




##########
File path: python/pyspark/pandas/frame.py
##########
@@ -6054,17 +6056,21 @@ def pivot_table(
                     # E.g. if column is b and values is ['b','e'],
                     # then ['2_b', '2_e', '3_b', '3_e'].
 
-                    # We sort the columns of Spark DataFrame by values.
-                    data_columns.sort(key=lambda x: x.split("_", 1)[1])

Review comment:
       Thank you for the feedback, I appreciate it! :)
   
   I tried using `-1` index, but it doesn't give the sort order expected by the 
tests. The `Series.unique` should only happen for a single column -- the 
docstring says only a single column is supported 
([Link](https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L5861-L5863)).
 Agreed that it's expensive, but I am not sure what else to do.
   
   What do you think about using: ` _columns = [str(i) for i in 
set(self[columns].tolist())]` instead of `_columns = [str(i) for i in 
self[columns].unique().tolist()]`? Would this be less expensive?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] oeuf commented on a change in pull request #34812: [WIP][PARK-37553][PYTHON] Fix underscore (`_`) bug in pyspark.pandas.frames.DataFrame.pivot_table

Reply via email to