Sam Whittle created BEAM-12144:
----------------------------------

             Summary: Dataflow streaming worker stuck and unable to get work 
from Streaming Engine
                 Key: BEAM-12144
                 URL: https://issues.apache.org/jira/browse/BEAM-12144
             Project: Beam
          Issue Type: Bug
          Components: runner-dataflow
    Affects Versions: 2.26.0
            Reporter: Sam Whittle
            Assignee: Sam Whittle


Observed in 2.26 but seems like it could affect later versions as well, as 
previous issues addressing similar problems were before 2.26.  This seems 
similar to BEAM-9651 but not the deadlock observed there.

The thread getting work has the following stack:

--- Threads (1): [Thread[DispatchThread,1,main]] State: WAITING stack: ---
  [email protected]/jdk.internal.misc.Unsafe.park(Native Method)
  
[email protected]/java.util.concurrent.locks.LockSupport.park(LockSupport.java:194)
  [email protected]/java.util.concurrent.Phaser$QNode.block(Phaser.java:1127)
  
[email protected]/java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3128)
  
[email protected]/java.util.concurrent.Phaser.internalAwaitAdvance(Phaser.java:1057)
  
[email protected]/java.util.concurrent.Phaser.awaitAdvanceInterruptibly(Phaser.java:747)
  
app//org.apache.beam.runners.dataflow.worker.windmill.DirectStreamObserver.onNext(DirectStreamObserver.java:49)
  
app//org.apache.beam.runners.dataflow.worker.windmill.GrpcWindmillServer$AbstractWindmillStream.send(GrpcWindmillServer.java:662)
  
app//org.apache.beam.runners.dataflow.worker.windmill.GrpcWindmillServer$GrpcGetWorkStream.onNewStream(GrpcWindmillServer.java:868)
  
app//org.apache.beam.runners.dataflow.worker.windmill.GrpcWindmillServer$AbstractWindmillStream.startStream(GrpcWindmillServer.java:677)
  
app//org.apache.beam.runners.dataflow.worker.windmill.GrpcWindmillServer$GrpcGetWorkStream.(GrpcWindmillServer.java:860)
  
app//org.apache.beam.runners.dataflow.worker.windmill.GrpcWindmillServer$GrpcGetWorkStream.(GrpcWindmillServer.java:843)
  
app//org.apache.beam.runners.dataflow.worker.windmill.GrpcWindmillServer.getWorkStream(GrpcWindmillServer.java:543)
  
app//org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.streamingDispatchLoop(StreamingDataflowWorker.java:1047)
  
app//org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker$1.run(StreamingDataflowWorker.java:670)
  [email protected]/java.lang.Thread.run(Thread.java:834)

The status page shows:
GetWorkStream: 0 buffers, 400 inflight messages allowed, 67108864 inflight 
bytes allowed, current stream is 61355396ms old, last send 61355396ms, last 
response -1ms

Showing that the stream was created 17 hours ago, sent the header message but 
never received a response.  With the stack trace it appears that the header was 
never sent but the stream also didn't terminate with a deadline exceed.  This 
seems like a grpc issue to not get an error for the stream, however it would be 
safer to not block indefinitely on the Phaser waiting for the send and instead 
throw an exception after 2x the stream deadline for example.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to