[jira] [Comment Edited] (SPARK-54666) pandas-on-Spark to_numeric silently downcasts int64 to float32, causing precision loss and value corruption

Vindhya G (Jira) Mon, 09 Feb 2026 22:32:45 -0800


    [ 
https://issues.apache.org/jira/browse/SPARK-54666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18057506#comment-18057506
 ]


Vindhya G edited comment on SPARK-54666 at 2/10/26 6:30 AM:
------------------------------------------------------------

I see this line being hit when I run the above program. 
https://github.com/apache/spark/blob/master/python/pyspark/pandas/namespace.py#L3657.
 Not sure why the default value is error="raise" in this function. 


was (Author: JIRAUSER299405):
I see this line being hit when I run the above program. 
https://github.com/apache/spark/blob/master/python/pyspark/pandas/namespace.py#L3658
 . Not sure why the default value is error="raise" in this function. 

> pandas-on-Spark to_numeric silently downcasts int64 to float32, causing 
> precision loss and value corruption
> -----------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-54666
>                 URL: https://issues.apache.org/jira/browse/SPARK-54666
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 4.0.1
>         Environment: Platform: Ubuntu 24.04 
> Linux-6.14.0-35-generic-x86_64-with-glibc2.39
> Python: 3.10.19 | packaged by conda-forge | (main, Oct 22 2025, 22:29:10) 
> [GCC 14.3.0]
> openjdk version "17.0.17-internal" 2025-10-21
> OpenJDK Runtime Environment (build 17.0.17-internal+0-adhoc..src)
> OpenJDK 64-Bit Server VM (build 17.0.17-internal+0-adhoc..src, mixed mode, 
> sharing)
> pyspark 4.0.1
> pandas 2.3.3
> pyarrow 22.0.0
>            Reporter: asddfl
>            Priority: Critical
>
> When using pandas API on Spark (pyspark.pandas), calling to_numeric on an 
> integer Series unexpectedly downcasts the data from int64 to float32.
> This behavior causes silent precision loss and numeric value corruption, 
> diverging from pandas semantics and violating numeric stability expectations.
> {code:python}
> import pandas as pd
> from pyspark.sql import SparkSession
> import pyspark.pandas as ps
> pd_t0 = pd.DataFrame(
>     {
>         'c0': [-1554478299, 2]
>     }
> )
> pd.set_option('display.float_format', lambda x: f"{x:.4f}")
> print("Pandas:")
> result = pd.to_numeric(pd_t0['c0'])
> print(result)
> spark = (
>     SparkSession.builder
>     .config("spark.sql.ansi.enabled", "false")
>     .getOrCreate()
> )
> ps_t0 = ps.DataFrame(
>     {
>         'c0': [-1554478299, 2]
>     }
> )
> print("PySpark Pandas:")
> result = ps.to_numeric(ps_t0['c0'])
> print(result)
> {code}
> {code:bash}
> Pandas:
> 0   -1554478299
> 1             2
> Name: c0, dtype: int64
> PySpark Pandas:
> 0   -1554478336.0000                                                          
>   
> 1             2.0000
> Name: c0, dtype: float32
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (SPARK-54666) pandas-on-Spark to_numeric silently downcasts int64 to float32, causing precision loss and value corruption

Reply via email to