[ 
https://issues.apache.org/jira/browse/SPARK-53900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Venkata Sai Akhil Gudesa updated SPARK-53900:
---------------------------------------------
    Description: 
 

A bug in ExecuteGrpcResponseSender causes RPC streams to hang indefinitely when 
the configured deadline passes. The bug was introduced in 
[[PR|https://github.com/apache/spark/pull/49003/files#diff-d4629281431427e41afd6d3db6630bcfdbfdbf77ba74cf7e48a988c1b66c13f1L244-L253]|https://github.com/apache/spark/pull/49003/files#diff-d4629281431427e41afd6d3db6630bcfdbfdbf77ba74cf7e48a988c1b66c13f1L244-L253]
 during migration from System.currentTimeMillis() to System.nanoTime(), where 
an integer division error converts sub-millisecond timeout values to 0, 
triggering Java's wait(0) behavior (infinite wait).
h2. Root Cause
executionObserver.responseLock.wait(timeoutNs / NANOS_PER_MILLIS)  // ← BUG
{*}The Problem{*}: When deadlineTimeNs < System.nanoTime() (deadline has 
passed):
 # Math.max(1, negative_value) clamps to 1 nanosecond

 # Math.min(progressInterval_ns, 1) remains 1 nanosecond

 # Integer division: 1 / 1,000,000 = 0 milliseconds

 # wait(0) in Java means *wait indefinitely until notified*

 # No notification arrives (execution already completed), thread hangs forever

While one the loop conditions guards against deadlineTimeNs < 
System.nanoTime(), it isn’t sufficient as the deadline can elapse while inside 
the loop (the time is freshly fetched in the latter timeout calculation). The 
probability of occurence can exacerbated by GC pauses
h2. Conditions Required for Bug to Trigger

The bug manifests when *all* of the following conditions are met:
 # *Reattachable execution enabled* (CONNECT_EXECUTE_REATTACHABLE_ENABLED = 
true)

 # *Execution completes prior* to the deadline within the inner loop

 # (all responses sent before deadline)

 # *Deadline passes* within the inner loop

h2. Proposed fix

Have timeoutNs always contain a positive value.
executionObserver.responseLock.wait(Math.max(1, timeoutNs / NANOS_PER_MILLIS))

> Thread.wait(0) unintentionally called under rare conditions in 
> ExecuteGrpcResponseSender
> ----------------------------------------------------------------------------------------
>
>                 Key: SPARK-53900
>                 URL: https://issues.apache.org/jira/browse/SPARK-53900
>             Project: Spark
>          Issue Type: Bug
>          Components: Connect
>    Affects Versions: 4.1.0, 4.0.0, 4.0.1, 4.0.2, 4.0, 4.2
>            Reporter: Venkata Sai Akhil Gudesa
>            Priority: Major
>
>  
> A bug in ExecuteGrpcResponseSender causes RPC streams to hang indefinitely 
> when the configured deadline passes. The bug was introduced in 
> [[PR|https://github.com/apache/spark/pull/49003/files#diff-d4629281431427e41afd6d3db6630bcfdbfdbf77ba74cf7e48a988c1b66c13f1L244-L253]|https://github.com/apache/spark/pull/49003/files#diff-d4629281431427e41afd6d3db6630bcfdbfdbf77ba74cf7e48a988c1b66c13f1L244-L253]
>  during migration from System.currentTimeMillis() to System.nanoTime(), where 
> an integer division error converts sub-millisecond timeout values to 0, 
> triggering Java's wait(0) behavior (infinite wait).
> h2. Root Cause
> executionObserver.responseLock.wait(timeoutNs / NANOS_PER_MILLIS)  // ← BUG
> {*}The Problem{*}: When deadlineTimeNs < System.nanoTime() (deadline has 
> passed):
>  # Math.max(1, negative_value) clamps to 1 nanosecond
>  # Math.min(progressInterval_ns, 1) remains 1 nanosecond
>  # Integer division: 1 / 1,000,000 = 0 milliseconds
>  # wait(0) in Java means *wait indefinitely until notified*
>  # No notification arrives (execution already completed), thread hangs forever
> While one the loop conditions guards against deadlineTimeNs < 
> System.nanoTime(), it isn’t sufficient as the deadline can elapse while 
> inside the loop (the time is freshly fetched in the latter timeout 
> calculation). The probability of occurence can exacerbated by GC pauses
> h2. Conditions Required for Bug to Trigger
> The bug manifests when *all* of the following conditions are met:
>  # *Reattachable execution enabled* (CONNECT_EXECUTE_REATTACHABLE_ENABLED = 
> true)
>  # *Execution completes prior* to the deadline within the inner loop
>  # (all responses sent before deadline)
>  # *Deadline passes* within the inner loop
> h2. Proposed fix
> Have timeoutNs always contain a positive value.
> executionObserver.responseLock.wait(Math.max(1, timeoutNs / NANOS_PER_MILLIS))



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to