[
https://issues.apache.org/jira/browse/SPARK-54666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hyukjin Kwon resolved SPARK-54666.
----------------------------------
Fix Version/s: 4.2.0
Resolution: Fixed
Issue resolved by pull request 54403
[https://github.com/apache/spark/pull/54403]
> pandas-on-Spark to_numeric silently downcasts int64 to float32, causing
> precision loss and value corruption
> -----------------------------------------------------------------------------------------------------------
>
> Key: SPARK-54666
> URL: https://issues.apache.org/jira/browse/SPARK-54666
> Project: Spark
> Issue Type: Bug
> Components: PySpark
> Affects Versions: 4.0.1
> Environment: Platform: Ubuntu 24.04
> Linux-6.14.0-35-generic-x86_64-with-glibc2.39
> Python: 3.10.19 | packaged by conda-forge | (main, Oct 22 2025, 22:29:10)
> [GCC 14.3.0]
> openjdk version "17.0.17-internal" 2025-10-21
> OpenJDK Runtime Environment (build 17.0.17-internal+0-adhoc..src)
> OpenJDK 64-Bit Server VM (build 17.0.17-internal+0-adhoc..src, mixed mode,
> sharing)
> pyspark 4.0.1
> pandas 2.3.3
> pyarrow 22.0.0
> Reporter: asddfl
> Priority: Critical
> Labels: pull-request-available
> Fix For: 4.2.0
>
>
> When using pandas API on Spark (pyspark.pandas), calling to_numeric on an
> integer Series unexpectedly downcasts the data from int64 to float32.
> This behavior causes silent precision loss and numeric value corruption,
> diverging from pandas semantics and violating numeric stability expectations.
> {code:python}
> import pandas as pd
> from pyspark.sql import SparkSession
> import pyspark.pandas as ps
> pd_t0 = pd.DataFrame(
> {
> 'c0': [-1554478299, 2]
> }
> )
> pd.set_option('display.float_format', lambda x: f"{x:.4f}")
> print("Pandas:")
> result = pd.to_numeric(pd_t0['c0'])
> print(result)
> spark = (
> SparkSession.builder
> .config("spark.sql.ansi.enabled", "false")
> .getOrCreate()
> )
> ps_t0 = ps.DataFrame(
> {
> 'c0': [-1554478299, 2]
> }
> )
> print("PySpark Pandas:")
> result = ps.to_numeric(ps_t0['c0'])
> print(result)
> {code}
> {code:bash}
> Pandas:
> 0 -1554478299
> 1 2
> Name: c0, dtype: int64
> PySpark Pandas:
> 0 -1554478336.0000
>
> 1 2.0000
> Name: c0, dtype: float32
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]