mhferjani opened a new pull request, #53617:
URL: https://github.com/apache/spark/pull/53617

   ### What changes were proposed in this pull request?
   
   This PR fixes the `pyspark.pandas.to_numeric` function to use `DoubleType` 
(float64) instead of `FloatType` (float32) when converting Series to numeric 
types.
   
   **Changes:**
   - Modified the Spark SQL type from `FloatType()` to `DoubleType()` in 
`python/pyspark/pandas/namespace.py`
   - Updated the docstring examples to reflect the correct `dtype: float64` 
output
   - Added regression tests to verify precision is preserved for large integers
   
   ### Why are the changes needed?
   
   When using `ps.to_numeric()` on an integer Series, the function silently 
downcasts from int64 to float32, causing **precision loss and value 
corruption**. This behavior diverges from pandas semantics where 
`pd.to_numeric()` preserves precision by using float64.
   
   **Reproducer:**
   
   ```python
   import pandas as pd
   import pyspark.pandas as ps
   
   # Create a Series with a large integer that exceeds float32 precision
   data = {'c0': [-1554478299, 2]}
   
   # Pandas behavior (correct)
   pd_result = pd.to_numeric(pd.DataFrame(data)['c0'])
   print(pd_result)
   # 0   -1554478299
   # 1             2
   # dtype: int64
   
   # PySpark pandas behavior (buggy - before fix)
   ps_result = ps.to_numeric(ps.DataFrame(data)['c0'])
   print(ps_result)
   # 0   -1554478336.0  <-- Value corrupted! Should be -1554478299
   # 1             2.0
   # dtype: float32
   ```
   
   **Root cause:** The float32 type only has ~7 digits of precision, while the 
value `-1554478299` requires 10 digits. This causes silent data corruption.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes. The output dtype of `ps.to_numeric()` changes from `float32` to 
`float64` when the input contains string or mixed types that need to be 
converted to float.
   
   **Before:**
   ```python
   >>> ps.to_numeric(ps.Series(['1.0', '2', '-3']))
   0    1.0
   1    2.0
   2   -3.0
   dtype: float32
   ```
   
   **After:**
   ```python
   >>> ps.to_numeric(ps.Series(['1.0', '2', '-3']))
   0    1.0
   1    2.0
   2   -3.0
   dtype: float64
   ```
   
   This change aligns the behavior with pandas `pd.to_numeric()` which uses 
float64 by default.
   
   ### How was this patch tested?
   
   - Updated existing unit tests in 
`python/pyspark/pandas/tests/test_namespace.py` to expect `float64` instead of 
`float32`
   - Added new regression test specifically for SPARK-54666 to verify large 
integers are preserved without precision loss:
     ```python
     def test_to_numeric_precision_spark_54666(self):
         # Regression test for SPARK-54666: to_numeric should not lose precision
         psser = ps.Series([-1554478299, 2])
         result = ps.to_numeric(psser)
         expected = pd.Series([-1554478299.0, 2.0], dtype='float64')
         self.assert_eq(result, expected)
     ```
   - Ran the full pandas-on-Spark test suite to ensure no regressions
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   co-authored : Claude (Anthropic)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to