[GitHub] [spark] Yikun opened a new pull request, #36699: [SPARK-39317][PYTHON][PS] Add explicitly pdf/pser infer when infer schema groupby.apply

GitBox Fri, 27 May 2022 02:36:54 -0700


Yikun opened a new pull request, #36699:
URL: https://github.com/apache/spark/pull/36699


   ### What changes were proposed in this pull request?
   
   Add explicitly pdf/pser infer when infer schema groupby.apply for ``
   
   ### Why are the changes needed?
   
   The root reason of [JIRA 
](https://issues.apache.org/jira/browse/SPARK-39317) mentioned `TypeError: 
field B: LongType() can not accept object 2 in type <class 'numpy.int64'>` is 
PS doesn't support init pandas df with wrong dtype when arrow disable, see 
below example:
   
   ```python
   import pandas as pd
   import numpy as np
   df = pd.DataFrame({'B': [np.int64(1), np.int64(2), np.int64(3)]}, 
dtype="object")
   # Failed
   ps.from_pandas(df)
   # Passed
   ps.from_pandas(df.infer_objects())
   ```
   
   Given that this process is only used in PS's schema inference, relatively 
only small data is processed. So, we can reduce the possible of wrong dtypes by 
calling infer_objects explicitly.
   
   **Why works with pandas < 1.4?** Unfortunately, the behavior changes of 
`series.replace({np.nan: None})` after Pandas 1.4:
   ```python
   # Pandas >= 1.4
   >>> import pandas as pd
   >>> import numpy as np
   >>> df = pd.DataFrame({'B': [np.int64(1), np.int64(2), np.int64(3)]}, 
dtype="object")
   >>> df.replace({np.nan: None}).dtypes
   B    object
   dtype: object
   
   # Pandas < 1.4
   >>> df = pd.DataFrame({'B': [np.int64(1), np.int64(2), np.int64(3)]}, 
dtype="object")
   >>> df.replace({np.nan: None}).dtypes
   B    int64
   dtype: object
   ```
   
   This change impacts the PS behavior of groupby.apply infer schema process 
([`ps.from_pandas(pser_or_pdf)`](https://github.com/apache/spark/blob/2a7a1b645b649d498d3e0a4d5508b8cd8d0912d2/python/pyspark/pandas/groupby.py#L1438)
 --> 
[`prepare_pandas_frame`](https://github.com/apache/spark/blob/2a7a1b645b649d498d3e0a4d5508b8cd8d0912d2/python/pyspark/pandas/internal.py#L1469)
 --> 
[`replace`](https://github.com/apache/spark/blob/2a7a1b645b649d498d3e0a4d5508b8cd8d0912d2/python/pyspark/pandas/data_type_ops/base.py#L492))
 finally.
   
   So, it includes an Implicit infer, this patch just add this infer back.
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   - CI
   - new added UT passed with pandas 1.4+ and before 1.4


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] Yikun opened a new pull request, #36699: [SPARK-39317][PYTHON][PS] Add explicitly pdf/pser infer when infer schema groupby.apply

Reply via email to