Devin Petersohn created SPARK-55977:
---------------------------------------

             Summary: isin() should not match values of incompatible types
                 Key: SPARK-55977
                 URL: https://issues.apache.org/jira/browse/SPARK-55977
             Project: Spark
          Issue Type: Bug
          Components: Pandas API on Spark
    Affects Versions: 4.1.1
            Reporter: Devin Petersohn


DataFrame.isin() and Series.isin() return True when comparing values of 
incompatible types. Spark's implicit type coercion causes string "1" to match 
integer 1, while pandas uses strict type matching and returns False.

 

{{import pandas as pd}}
{{import pyspark.pandas as ps}}

{{# DataFrame.isin with list}}
{{pdf = pd.DataFrame(\{"a": [1, 2, 3]})}}
{{{}psdf = ps.from_pandas(pdf){}}}{{{}pdf.isin(["1", "2"])["a"].tolist()  # 
[False, False, False] {}}}
{{psdf.isin(["1", "2"])["a"].tolist() # [True, True, False]}}

{{# DataFrame.isin with dict}}
{{pdf.isin(\{"a": ["1", "2"]})["a"].tolist()  # [False, False, False]}}
{{psdf.isin(\{"a": ["1", "2"]})["a"].tolist() # [True, True, False]}}

{{# Series.isin }}
{{pd.Series([1, 2, 3]).isin(["1", "2"]).tolist() # [False, False, False] }}
{{ps.Series([1, 2, 3]).isin(["1", "2"]).tolist() # [True, True, False]}}

{{# Numeric cross-type works correctly in both (int col, float values)}}
{{pd.Series([1, 2, 3]).isin([1.0, 2.0]).tolist() # [True, True, False] }}
{{ps.Series([1, 2, 3]).isin([1.0, 2.0]).tolist() # [True, True, False]}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to