[jira] [Commented] (SPARK-54665) pandas-on-Spark Boolean vs String comparison yields inconsistent result with pandas

Tian Gao (Jira) Mon, 15 Dec 2025 11:47:25 -0800


    [ 
https://issues.apache.org/jira/browse/SPARK-54665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18045267#comment-18045267
 ]


Tian Gao commented on SPARK-54665:
----------------------------------

I think this is by design? You explicitly turned off ansi mode which enables 
the implicit conversion for spark itself. I don't think this is a bug.

> pandas-on-Spark Boolean vs String comparison yields inconsistent result with 
> pandas
> -----------------------------------------------------------------------------------
>
>                 Key: SPARK-54665
>                 URL: https://issues.apache.org/jira/browse/SPARK-54665
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 4.0.1
>         Environment: Platform: Ubuntu 24.04 
> Linux-6.14.0-35-generic-x86_64-with-glibc2.39
> Python: 3.10.19 | packaged by conda-forge | (main, Oct 22 2025, 22:29:10) 
> [GCC 14.3.0]
> openjdk version "17.0.17-internal" 2025-10-21
> OpenJDK Runtime Environment (build 17.0.17-internal+0-adhoc..src)
> OpenJDK 64-Bit Server VM (build 17.0.17-internal+0-adhoc..src, mixed mode, 
> sharing)
> pyspark 4.0.1
> pandas 2.3.3
> pyarrow 22.0.0
>            Reporter: asddfl
>            Priority: Critical
>
> When using pandas-on-Spark (pyspark.pandas / pandas API on Spark), comparing 
> a boolean Series with a string literal produces a result that is inconsistent 
> with native pandas.
> This behavior diverges from pandas semantics and may cause silent logic 
> differences when running pandas-compatible code on Spark.
> {code:python}
> import pandas as pd
> from pyspark.sql import SparkSession
> import pyspark.pandas as ps
> pd_t1 = pd.DataFrame(
>     {
>         'c1': [True]
>     }
> )
> print("Pandas:")
> print(pd_t1['c1'] == 'True')
> spark = (
>     SparkSession.builder
>     .config("spark.sql.ansi.enabled", "false")
>     .getOrCreate()
> )
> ps_t1 = ps.DataFrame(
>     {
>         'c1': [True]
>     }
> )
> print("PySpark Pandas:")
> print(ps_t1['c1'] == 'True')
> {code}
> {code:bash}
> Pandas:
> 0    False
> Name: c1, dtype: bool
> PySpark Pandas:
> 0    True                                                                     
>   
> Name: c1, dtype: bool
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-54665) pandas-on-Spark Boolean vs String comparison yields inconsistent result with pandas

Reply via email to