[
https://issues.apache.org/jira/browse/SPARK-54666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18057506#comment-18057506
]
Vindhya G edited comment on SPARK-54666 at 2/10/26 6:30 AM:
------------------------------------------------------------
I see this line being hit when I run the above program.
https://github.com/apache/spark/blob/master/python/pyspark/pandas/namespace.py#L3657.
Not sure why the default value is error="raise" in this function.
was (Author: JIRAUSER299405):
I see this line being hit when I run the above program.
https://github.com/apache/spark/blob/master/python/pyspark/pandas/namespace.py#L3658
. Not sure why the default value is error="raise" in this function.
> pandas-on-Spark to_numeric silently downcasts int64 to float32, causing
> precision loss and value corruption
> -----------------------------------------------------------------------------------------------------------
>
> Key: SPARK-54666
> URL: https://issues.apache.org/jira/browse/SPARK-54666
> Project: Spark
> Issue Type: Bug
> Components: PySpark
> Affects Versions: 4.0.1
> Environment: Platform: Ubuntu 24.04
> Linux-6.14.0-35-generic-x86_64-with-glibc2.39
> Python: 3.10.19 | packaged by conda-forge | (main, Oct 22 2025, 22:29:10)
> [GCC 14.3.0]
> openjdk version "17.0.17-internal" 2025-10-21
> OpenJDK Runtime Environment (build 17.0.17-internal+0-adhoc..src)
> OpenJDK 64-Bit Server VM (build 17.0.17-internal+0-adhoc..src, mixed mode,
> sharing)
> pyspark 4.0.1
> pandas 2.3.3
> pyarrow 22.0.0
> Reporter: asddfl
> Priority: Critical
>
> When using pandas API on Spark (pyspark.pandas), calling to_numeric on an
> integer Series unexpectedly downcasts the data from int64 to float32.
> This behavior causes silent precision loss and numeric value corruption,
> diverging from pandas semantics and violating numeric stability expectations.
> {code:python}
> import pandas as pd
> from pyspark.sql import SparkSession
> import pyspark.pandas as ps
> pd_t0 = pd.DataFrame(
> {
> 'c0': [-1554478299, 2]
> }
> )
> pd.set_option('display.float_format', lambda x: f"{x:.4f}")
> print("Pandas:")
> result = pd.to_numeric(pd_t0['c0'])
> print(result)
> spark = (
> SparkSession.builder
> .config("spark.sql.ansi.enabled", "false")
> .getOrCreate()
> )
> ps_t0 = ps.DataFrame(
> {
> 'c0': [-1554478299, 2]
> }
> )
> print("PySpark Pandas:")
> result = ps.to_numeric(ps_t0['c0'])
> print(result)
> {code}
> {code:bash}
> Pandas:
> 0 -1554478299
> 1 2
> Name: c0, dtype: int64
> PySpark Pandas:
> 0 -1554478336.0000
>
> 1 2.0000
> Name: c0, dtype: float32
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]