[
https://issues.apache.org/jira/browse/SPARK-53948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
L. C. Hsieh updated SPARK-53948:
--------------------------------
Description:
Observation class has been evolved a few times during Spark 3.5 to Spark 4.0.0.
Previously it uses locking mechanism (synchronized) between get and onFinish
methods to coordinate metrics update and retrieval.
But it has a potential deadlocking bug. If get is called before
ObservationListener is triggered to call onFinish, get will forever be waiting
for metrics because it locks the observation object by synchronized so later
onFinish call is locked out from updating the metrics.
This locking mechanism was replaced by a promise by SPARK-49423 which is a
large refactoring on the observation feature. But in the PR, I don’t see the
deadlock bug was mentioned, and there is no bug fix PR proposed to earlier
versions. So I think that the bug was not known and the fix is unintentional in
Spark 4.0.0. The bug is still in Spark 3.5 branch.
was:
Observation class has been evolved a few times during Spark 3.5 to Spark 4.0.0.
Previously it uses locking mechanism (synchronized) between get and onFinish
methods to coordinate metrics update and retrieval.
But it has a potential deadlocking bug. If get is called before
ObservationListener is triggered to call onFinish, get will forever be waiting
for metrics because it locks the observation object by synchronized so later
onFinish call is locked out from updating the metrics.
This locking mechanism was replaced by a promise by SPARK-49423. But in the PR,
I don’t see the deadlock bug was mentioned, and there is no bug fix PR proposed
to earlier versions. So I think that the bug was not known and the fix is
unintentional in Spark 4.0.0. The bug is still in Spark 3.5 branch.
> Fix deadlock in Observation
> ---------------------------
>
> Key: SPARK-53948
> URL: https://issues.apache.org/jira/browse/SPARK-53948
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 3.5.7
> Reporter: L. C. Hsieh
> Priority: Major
>
> Observation class has been evolved a few times during Spark 3.5 to Spark
> 4.0.0. Previously it uses locking mechanism (synchronized) between get and
> onFinish methods to coordinate metrics update and retrieval.
> But it has a potential deadlocking bug. If get is called before
> ObservationListener is triggered to call onFinish, get will forever be
> waiting for metrics because it locks the observation object by synchronized
> so later onFinish call is locked out from updating the metrics.
> This locking mechanism was replaced by a promise by SPARK-49423 which is a
> large refactoring on the observation feature. But in the PR, I don’t see the
> deadlock bug was mentioned, and there is no bug fix PR proposed to earlier
> versions. So I think that the bug was not known and the fix is unintentional
> in Spark 4.0.0. The bug is still in Spark 3.5 branch.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]