Yikun opened a new pull request, #36699: URL: https://github.com/apache/spark/pull/36699
### What changes were proposed in this pull request? Add explicitly pdf/pser infer when infer schema groupby.apply for `` ### Why are the changes needed? The root reason of [JIRA ](https://issues.apache.org/jira/browse/SPARK-39317) mentioned `TypeError: field B: LongType() can not accept object 2 in type <class 'numpy.int64'>` is PS doesn't support init pandas df with wrong dtype when arrow disable, see below example: ```python import pandas as pd import numpy as np df = pd.DataFrame({'B': [np.int64(1), np.int64(2), np.int64(3)]}, dtype="object") # Failed ps.from_pandas(df) # Passed ps.from_pandas(df.infer_objects()) ``` Given that this process is only used in PS's schema inference, relatively only small data is processed. So, we can reduce the possible of wrong dtypes by calling infer_objects explicitly. **Why works with pandas < 1.4?** Unfortunately, the behavior changes of `series.replace({np.nan: None})` after Pandas 1.4: ```python # Pandas >= 1.4 >>> import pandas as pd >>> import numpy as np >>> df = pd.DataFrame({'B': [np.int64(1), np.int64(2), np.int64(3)]}, dtype="object") >>> df.replace({np.nan: None}).dtypes B object dtype: object # Pandas < 1.4 >>> df = pd.DataFrame({'B': [np.int64(1), np.int64(2), np.int64(3)]}, dtype="object") >>> df.replace({np.nan: None}).dtypes B int64 dtype: object ``` This change impacts the PS behavior of groupby.apply infer schema process ([`ps.from_pandas(pser_or_pdf)`](https://github.com/apache/spark/blob/2a7a1b645b649d498d3e0a4d5508b8cd8d0912d2/python/pyspark/pandas/groupby.py#L1438) --> [`prepare_pandas_frame`](https://github.com/apache/spark/blob/2a7a1b645b649d498d3e0a4d5508b8cd8d0912d2/python/pyspark/pandas/internal.py#L1469) --> [`replace`](https://github.com/apache/spark/blob/2a7a1b645b649d498d3e0a4d5508b8cd8d0912d2/python/pyspark/pandas/data_type_ops/base.py#L492)) finally. So, it includes an Implicit infer, this patch just add this infer back. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - CI - new added UT passed with pandas 1.4+ and before 1.4 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
