[
https://issues.apache.org/jira/browse/SPARK-45834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jiayi Liu updated SPARK-45834:
------------------------------
Description:
Spark uses the formula {{ck / sqrt(xMk * yMk)}} to calculate the Pearson
Correlation Coefficient. If {{xMk}} and {{yMk}} are very small, it can lead to
double multiplication overflow, resulting in a denominator of 0. This leads to
an Infinity result in the calculation.
For example, when calculating the correlation for the same columns a and b in a
table, the result will be Infinity, but the correlation for identical columns
should be 1.0 instead.
||a||b||
|1e-200|1e-200|
|1e-200|1e-200|
|1e-100|1e-100|
Modifying the formula to {{ck / sqrt(xMk) / sqrt(yMk)}} can indeed solve this
issue and improve the stability of the calculation. The benefit of this
modification is that it splits the square root of the denominator into two
parts: {{sqrt(xMk)}} and {{{}sqrt(yMk){}}}. This helps avoid multiplication
overflow or cases where the product of extremely small values becomes zero.
was:
Spark uses the formula {{ck / sqrt(xMk * yMk)}} to calculate the Pearson
Correlation Coefficient. If {{xMk}} and {{yMk}} are very small, it can lead to
double multiplication overflow, resulting in a denominator of 0. This leads to
a NaN result in the calculation.
For example, when calculating the correlation for the same columns a and b in a
table, the result will be Infinity, but the correlation for identical columns
should be 1.0 instead.
||a||b||
|1e-200|1e-200|
|1e-200|1e-200|
|1e-100|1e-100|
Modifying the formula to {{ck / sqrt(xMk) / sqrt(yMk)}} can indeed solve this
issue and improve the stability of the calculation. The benefit of this
modification is that it splits the square root of the denominator into two
parts: {{sqrt(xMk)}} and {{{}sqrt(yMk){}}}. This helps avoid multiplication
overflow or cases where the product of extremely small values becomes zero.
> Fix Pearson correlation calculation more stable
> -----------------------------------------------
>
> Key: SPARK-45834
> URL: https://issues.apache.org/jira/browse/SPARK-45834
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 3.5.0
> Reporter: Jiayi Liu
> Priority: Major
>
> Spark uses the formula {{ck / sqrt(xMk * yMk)}} to calculate the Pearson
> Correlation Coefficient. If {{xMk}} and {{yMk}} are very small, it can lead
> to double multiplication overflow, resulting in a denominator of 0. This
> leads to an Infinity result in the calculation.
> For example, when calculating the correlation for the same columns a and b in
> a table, the result will be Infinity, but the correlation for identical
> columns should be 1.0 instead.
> ||a||b||
> |1e-200|1e-200|
> |1e-200|1e-200|
> |1e-100|1e-100|
> Modifying the formula to {{ck / sqrt(xMk) / sqrt(yMk)}} can indeed solve this
> issue and improve the stability of the calculation. The benefit of this
> modification is that it splits the square root of the denominator into two
> parts: {{sqrt(xMk)}} and {{{}sqrt(yMk){}}}. This helps avoid multiplication
> overflow or cases where the product of extremely small values becomes zero.
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]