[
https://issues.apache.org/jira/browse/SPARK-53900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Venkata Sai Akhil Gudesa updated SPARK-53900:
---------------------------------------------
Description:
A bug in ExecuteGrpcResponseSender causes RPC streams to hang indefinitely when
the configured deadline passes. The bug was introduced in
[[PR|https://github.com/apache/spark/pull/49003/files#diff-d4629281431427e41afd6d3db6630bcfdbfdbf77ba74cf7e48a988c1b66c13f1L244-L253]|https://github.com/apache/spark/pull/49003/files#diff-d4629281431427e41afd6d3db6630bcfdbfdbf77ba74cf7e48a988c1b66c13f1L244-L253]
during migration from System.currentTimeMillis() to System.nanoTime(), where
an integer division error converts sub-millisecond timeout values to 0,
triggering Java's wait(0) behavior (infinite wait).
h2. Root Cause
executionObserver.responseLock.wait(timeoutNs / NANOS_PER_MILLIS) // ← BUG
{*}The Problem{*}: When deadlineTimeNs < System.nanoTime() (deadline has
passed):
# Math.max(1, negative_value) clamps to 1 nanosecond
# Math.min(progressInterval_ns, 1) remains 1 nanosecond
# Integer division: 1 / 1,000,000 = 0 milliseconds
# wait(0) in Java means *wait indefinitely until notified*
# No notification arrives (execution already completed), thread hangs forever
While one the loop conditions guards against deadlineTimeNs <
System.nanoTime(), it isn’t sufficient as the deadline can elapse while inside
the loop (the time is freshly fetched in the latter timeout calculation). The
probability of occurence can exacerbated by GC pauses
h2. Conditions Required for Bug to Trigger
The bug manifests when *all* of the following conditions are met:
# *Reattachable execution enabled* (CONNECT_EXECUTE_REATTACHABLE_ENABLED =
true)
# *Execution completes prior* to the deadline within the inner loop
# (all responses sent before deadline)
# *Deadline passes* within the inner loop
h2. Proposed fix
Have timeoutNs always contain a positive value.
executionObserver.responseLock.wait(Math.max(1, timeoutNs / NANOS_PER_MILLIS))
> Thread.wait(0) unintentionally called under rare conditions in
> ExecuteGrpcResponseSender
> ----------------------------------------------------------------------------------------
>
> Key: SPARK-53900
> URL: https://issues.apache.org/jira/browse/SPARK-53900
> Project: Spark
> Issue Type: Bug
> Components: Connect
> Affects Versions: 4.1.0, 4.0.0, 4.0.1, 4.0.2, 4.0, 4.2
> Reporter: Venkata Sai Akhil Gudesa
> Priority: Major
>
>
> A bug in ExecuteGrpcResponseSender causes RPC streams to hang indefinitely
> when the configured deadline passes. The bug was introduced in
> [[PR|https://github.com/apache/spark/pull/49003/files#diff-d4629281431427e41afd6d3db6630bcfdbfdbf77ba74cf7e48a988c1b66c13f1L244-L253]|https://github.com/apache/spark/pull/49003/files#diff-d4629281431427e41afd6d3db6630bcfdbfdbf77ba74cf7e48a988c1b66c13f1L244-L253]
> during migration from System.currentTimeMillis() to System.nanoTime(), where
> an integer division error converts sub-millisecond timeout values to 0,
> triggering Java's wait(0) behavior (infinite wait).
> h2. Root Cause
> executionObserver.responseLock.wait(timeoutNs / NANOS_PER_MILLIS) // ← BUG
> {*}The Problem{*}: When deadlineTimeNs < System.nanoTime() (deadline has
> passed):
> # Math.max(1, negative_value) clamps to 1 nanosecond
> # Math.min(progressInterval_ns, 1) remains 1 nanosecond
> # Integer division: 1 / 1,000,000 = 0 milliseconds
> # wait(0) in Java means *wait indefinitely until notified*
> # No notification arrives (execution already completed), thread hangs forever
> While one the loop conditions guards against deadlineTimeNs <
> System.nanoTime(), it isn’t sufficient as the deadline can elapse while
> inside the loop (the time is freshly fetched in the latter timeout
> calculation). The probability of occurence can exacerbated by GC pauses
> h2. Conditions Required for Bug to Trigger
> The bug manifests when *all* of the following conditions are met:
> # *Reattachable execution enabled* (CONNECT_EXECUTE_REATTACHABLE_ENABLED =
> true)
> # *Execution completes prior* to the deadline within the inner loop
> # (all responses sent before deadline)
> # *Deadline passes* within the inner loop
> h2. Proposed fix
> Have timeoutNs always contain a positive value.
> executionObserver.responseLock.wait(Math.max(1, timeoutNs / NANOS_PER_MILLIS))
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]