ueshin commented on code in PR #54146:
URL: https://github.com/apache/spark/pull/54146#discussion_r2766403000


##########
python/pyspark/pandas/tests/data_type_ops/testing_utils.py:
##########
@@ -219,3 +220,6 @@ def check_extension(self, left, right):
         pandas versions. Please refer to 
https://github.com/pandas-dev/pandas/issues/39410.
         """
         self.assert_eq(left, right)
+
+    def ignore_null(self, col):
+        return LooseVersion(pd.__version__) >= LooseVersion("3.0") and col == 
"decimal_nan"

Review Comment:
   The actually issue should be how `decimal.Decimal(np.nan)` is handled?
   
   The other numeric types, `None` will be `NaN` when converting to pandas, 
which is well-handled.
   The other types, `None` will be `None` anyway.
   
   - The string type is a bit different; previously it was `object` dtype with 
`None`, but now it's `StringDtype` with `NaN` for null, which was fixed at 
apache/spark#54015
   
   But `decimal.Decimal(np.nan)` is kind of special value that Spark can't 
handle well anyway?
   
   It will be `None` in pandas API on Spark as Spark doesn't have a concept of 
`NaN` on decimal type.
   
   ```py
   >>> pdf = pd.DataFrame([decimal.Decimal(np.nan), None])
   >>> pdf
         0
   0   NaN
   1  None
   >>>
   >>> psdf = ps.from_pandas(pdf)
   >>> psdf
         0
   0  None
   1  None
   ```
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to