oeuf opened a new pull request #34812:
URL: https://github.com/apache/spark/pull/34812


   ### What changes were proposed in this pull request?
   - Adds code changes to allow for underscores in the elements for the 
`columns` arg and for the column names used for the `values` arg.
   
   
   ### Why are the changes needed?
   Fixes a bug with the method `pyspark.pandas.frames.DataFrame.pivot_table` 
that causes a `KeyError` when an underscore is present (more details in 
[SPARK-37553](https://issues.apache.org/jira/browse/SPARK-37553)).
   ```python
   >>> import numpy as np
   >>> import pandas as pd
   
   >>> from pyspark import pandas as ps
   
   >>> pdf = pd.DataFrame(
           {
               "a": [4, 2, 3, 4, 8, 6],
               "b_b": [1, 2, 2, 4, 2, 4],
               "e": [10, 20, 20, 40, 20, 40],
               "c": [1, 2, 9, 4, 7, 4],
               "d": [-1, -2, -3, -4, -5, -6],
           },
           index=np.random.rand(6),
       )
   >>> psdf = ps.from_pandas(pdf)
   >>> psdf.pivot_table(index=["c"], columns="a", values=["b_b", "e"])
   
   ---------------------------------------------------------------------------
   KeyError                                  Traceback (most recent call last)
   <ipython-input-8-32d5bb0e1166> in <module>
   ----> 1 psdf.pivot_table(index=["c"], columns="a", values=["b_b", "e"])
   
   
~/.pyenv/versions/3.7.9/envs/venv37/lib/python3.7/site-packages/pyspark/pandas/frame.py
 in pivot_table(self, values, index, columns, aggfunc, fill_value)
      6053                     column_labels = [
      6054                         
tuple(list(column_name_to_index[name.split("_")[1]]) + [name.split("_")[0]])
   -> 6055                         for name in data_columns
      6056                     ]
      6057                     column_label_names = (
   
   
~/.pyenv/versions/3.7.9/envs/venv37/lib/python3.7/site-packages/pyspark/pandas/frame.py
 in <listcomp>(.0)
      6053                     column_labels = [
      6054                         
tuple(list(column_name_to_index[name.split("_")[1]]) + [name.split("_")[0]])
   -> 6055                         for name in data_columns
      6056                     ]
      6057                     column_label_names = (
   
   KeyError: 'b'
   ```
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   ### How was this patch tested?
   - [x] Add unit tests for code changes
   - [] Build package via Github Actions 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to