[ 
https://issues.apache.org/jira/browse/SPARK-53948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh updated SPARK-53948:
--------------------------------
    Description: 
Observation class has been evolved a few times during Spark 3.5 to Spark 4.0.0. 
Previously it uses locking mechanism (synchronized) between get and onFinish 
methods to coordinate metrics update and retrieval.

But it has a potential deadlocking bug. If get is called before 
ObservationListener is triggered to call onFinish, get will forever be waiting 
for metrics because it locks the observation object by synchronized so later 
onFinish call is locked out from updating the metrics.

This locking mechanism was replaced by a promise by SPARK-49423 which is a 
large refactoring on the observation feature. But in the PR, I don’t see the 
deadlock bug was mentioned, and there is no bug fix PR proposed to earlier 
versions. So I think that the bug was not known and the fix is unintentional in 
Spark 4.0.0. The bug is still in Spark 3.5 branch.

  was:
Observation class has been evolved a few times during Spark 3.5 to Spark 4.0.0. 
Previously it uses locking mechanism (synchronized) between get and onFinish 
methods to coordinate metrics update and retrieval.

But it has a potential deadlocking bug. If get is called before 
ObservationListener is triggered to call onFinish, get will forever be waiting 
for metrics because it locks the observation object by synchronized so later 
onFinish call is locked out from updating the metrics.

This locking mechanism was replaced by a promise by SPARK-49423. But in the PR, 
I don’t see the deadlock bug was mentioned, and there is no bug fix PR proposed 
to earlier versions. So I think that the bug was not known and the fix is 
unintentional in Spark 4.0.0. The bug is still in Spark 3.5 branch.


> Fix deadlock in Observation
> ---------------------------
>
>                 Key: SPARK-53948
>                 URL: https://issues.apache.org/jira/browse/SPARK-53948
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.5.7
>            Reporter: L. C. Hsieh
>            Priority: Major
>
> Observation class has been evolved a few times during Spark 3.5 to Spark 
> 4.0.0. Previously it uses locking mechanism (synchronized) between get and 
> onFinish methods to coordinate metrics update and retrieval.
> But it has a potential deadlocking bug. If get is called before 
> ObservationListener is triggered to call onFinish, get will forever be 
> waiting for metrics because it locks the observation object by synchronized 
> so later onFinish call is locked out from updating the metrics.
> This locking mechanism was replaced by a promise by SPARK-49423 which is a 
> large refactoring on the observation feature. But in the PR, I don’t see the 
> deadlock bug was mentioned, and there is no bug fix PR proposed to earlier 
> versions. So I think that the bug was not known and the fix is unintentional 
> in Spark 4.0.0. The bug is still in Spark 3.5 branch.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to