[
https://issues.apache.org/jira/browse/BEAM-6451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16904727#comment-16904727
]
Oded Valtzer commented on BEAM-6451:
------------------------------------
did someone have a look on this issue? we are having those "processing stuck"
on operations and we can't understand why this happens. it's on 2.14 python sdk
on dataflow. mainly happening when we have intense-long running steps...
> Portability Pipeline eventually hangs on bundle registration
> ------------------------------------------------------------
>
> Key: BEAM-6451
> URL: https://issues.apache.org/jira/browse/BEAM-6451
> Project: Beam
> Issue Type: Bug
> Components: java-fn-execution, runner-dataflow, sdk-py-harness
> Reporter: Scott Wegner
> Assignee: Mikhail Gryzykhin
> Priority: Minor
> Labels: portability, triaged
>
> We've seen jobs using portability start off in a healthy state, but then
> eventually get stuck and hang on bundle registration. We see error logs from
> the worker harness:
> {code}
> Processing stuck in step s01 for at least 06h30m00s without outputting or
> completing in state finish at
> sun.misc.Unsafe.park(Native Method) at
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at
> java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1693)
> at
> java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323) at
> java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1729)
> at
> java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895) at
> org.apache.beam.sdk.util.MoreFutures.get(MoreFutures.java:57) at
> org.apache.beam.runners.dataflow.worker.fn.control.RegisterAndProcessBundleOperation.finish(RegisterAndProcessBundleOperation.java:277)
> at
> org.apache.beam.runners.dataflow.worker.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:85)
> at
> org.apache.beam.runners.dataflow.worker.fn.control.BeamFnMapTaskExecutor.execute(BeamFnMapTaskExecutor.java:119)
> at
> org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.process(StreamingDataflowWorker.java:1226)
> at
> org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.access$1000(StreamingDataflowWorker.java:141)
> at
> org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker$6.run(StreamingDataflowWorker.java:965)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at
> java.lang.Thread.run(Thread.java:745)
> {code}
> Looking at [the
> code|https://github.com/apache/beam/blob/release-2.8.0/runners/google-cloud-dataflow-java/worker/src/main/java/org/apache/beam/runners/dataflow/worker/fn/control/RegisterAndProcessBundleOperation.java#L277],
> it looks like there are no timeouts on the Bundle Registration calls over
> the FnApi, which contributes to this hanging forever rather than giving a
> better failure.
> This bug report came from a customer running a python streaming pipeline
> using the new portability framework on Dataflow. Hopefully we can repro on
> our own in order to link to the job / logs.
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)