[jira] [Resolved] (SPARK-54666) pandas-on-Spark to_numeric silently downcasts int64 to float32, causing precision loss and value corruption

Hyukjin Kwon (Jira) Tue, 24 Feb 2026 14:17:27 -0800


     [ 
https://issues.apache.org/jira/browse/SPARK-54666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Hyukjin Kwon resolved SPARK-54666.
----------------------------------
    Fix Version/s: 4.2.0
       Resolution: Fixed

Issue resolved by pull request 54403
[https://github.com/apache/spark/pull/54403]

> pandas-on-Spark to_numeric silently downcasts int64 to float32, causing 
> precision loss and value corruption
> -----------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-54666
>                 URL: https://issues.apache.org/jira/browse/SPARK-54666
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 4.0.1
>         Environment: Platform: Ubuntu 24.04 
> Linux-6.14.0-35-generic-x86_64-with-glibc2.39
> Python: 3.10.19 | packaged by conda-forge | (main, Oct 22 2025, 22:29:10) 
> [GCC 14.3.0]
> openjdk version "17.0.17-internal" 2025-10-21
> OpenJDK Runtime Environment (build 17.0.17-internal+0-adhoc..src)
> OpenJDK 64-Bit Server VM (build 17.0.17-internal+0-adhoc..src, mixed mode, 
> sharing)
> pyspark 4.0.1
> pandas 2.3.3
> pyarrow 22.0.0
>            Reporter: asddfl
>            Priority: Critical
>              Labels: pull-request-available
>             Fix For: 4.2.0
>
>
> When using pandas API on Spark (pyspark.pandas), calling to_numeric on an 
> integer Series unexpectedly downcasts the data from int64 to float32.
> This behavior causes silent precision loss and numeric value corruption, 
> diverging from pandas semantics and violating numeric stability expectations.
> {code:python}
> import pandas as pd
> from pyspark.sql import SparkSession
> import pyspark.pandas as ps
> pd_t0 = pd.DataFrame(
>     {
>         'c0': [-1554478299, 2]
>     }
> )
> pd.set_option('display.float_format', lambda x: f"{x:.4f}")
> print("Pandas:")
> result = pd.to_numeric(pd_t0['c0'])
> print(result)
> spark = (
>     SparkSession.builder
>     .config("spark.sql.ansi.enabled", "false")
>     .getOrCreate()
> )
> ps_t0 = ps.DataFrame(
>     {
>         'c0': [-1554478299, 2]
>     }
> )
> print("PySpark Pandas:")
> result = ps.to_numeric(ps_t0['c0'])
> print(result)
> {code}
> {code:bash}
> Pandas:
> 0   -1554478299
> 1             2
> Name: c0, dtype: int64
> PySpark Pandas:
> 0   -1554478336.0000                                                          
>   
> 1             2.0000
> Name: c0, dtype: float32
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Resolved] (SPARK-54666) pandas-on-Spark to_numeric silently downcasts int64 to float32, causing precision loss and value corruption

Reply via email to