[
https://issues.apache.org/jira/browse/SPARK-42905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated SPARK-42905:
-----------------------------------
Labels: correctness pull-request-available (was: correctness)
> pyspark.ml.stat.Correlation - Spearman Correlation method giving incorrect
> and inconsistent results for the same DataFrame if it has huge amount of Ties.
> ---------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: SPARK-42905
> URL: https://issues.apache.org/jira/browse/SPARK-42905
> Project: Spark
> Issue Type: Bug
> Components: ML
> Affects Versions: 3.3.0
> Reporter: dronzer
> Priority: Critical
> Labels: correctness, pull-request-available
> Attachments: image-2023-03-23-10-51-28-420.png,
> image-2023-03-23-10-52-11-481.png, image-2023-03-23-10-52-49-392.png,
> image-2023-03-23-10-53-37-461.png, image-2023-03-23-10-55-26-879.png
>
>
> pyspark.ml.stat.Correlation
> Following is the Scenario where the Correlation function fails for giving
> correct Spearman Coefficient Results.
> Tested E.g -> Spark DataFrame has 2 columns A and B.
> !image-2023-03-23-10-55-26-879.png|width=562,height=162!
> Column A has 3 Distinct Values and total of 108Million rows
> Column B has 4 Distinct Values and total of 108Million rows
> If I Calculate the correlation for this DataFrame in Python Pandas DF.corr,
> it gives the correct answer even if i run the same code multiple times the
> same answer is produced. (Each column has only 3-4 distinct values)
> !image-2023-03-23-10-53-37-461.png|width=468,height=287!
>
> Coming to Spark and using Spearman Correlation produces a *different results*
> for the *same dataframe* on multiple runs. (see below) (each column in this
> df has only 3-4 distinct values)
> !image-2023-03-23-10-52-49-392.png|width=516,height=322!
>
> Basically in python Pandas Df.corr it gives same results on same dataframe on
> multiple runs which is expected behaviour. However, in Spark using the same
> data it gives different result, moreover running the same cell with same data
> multiple times produces different results meaning the output is inconsistent.
> Coming to data the only observation I could conclude is Ties in data. (Only
> 3-4 Distinct values over 108M Rows.) This scenario is not handled in Spark
> Correlation method as the same data when used in python using df.corr
> produces consistent results.
> The only Workaround we could find to get consistent and the same output as
> from python in Spark is by using Pandas UDF as shown below:
> !image-2023-03-23-10-52-11-481.png|width=518,height=111!
> !image-2023-03-23-10-51-28-420.png|width=509,height=270!
>
> We also tried pyspark.pandas.DataFrame .corr method and it produces incorrect
> and inconsistent results for this case too.
> Only PandasUDF seems to provide consistent results.
>
> Another point to note is : If i add some random noise to the data, which will
> inturn increase the distinct values in the data. It again gives consistent
> results for any runs. Which makes me believe that the Python version handles
> ties correctly and gives consistent results no matter how many ties exist.
> However, pyspark method is somehow not able to handle many ties in data.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]