tvalentyn commented on issue #30867: URL: https://github.com/apache/beam/issues/30867#issuecomment-2048596523
@DerRidda would it be possible for you to re-run your job using Beam 2.55 SDK, then find a stuck worker VM, and retrieve stacktraces with pystack as I did in: https://github.com/apache/beam/issues/30867#issuecomment-2048463960 ? Note: you might have to use Dataflow Classic, I am not certain if SSHing into workers is possible with Dataflow Prime. To find a Stuck VM, look for "Unable to retrieve status info from SDK harness." logs, then find which worker emits those logs by expanding the logging entry in Cloud Logging. you might see something like: ``` { insertId: "7918190538793288800:164250:0:18926" jsonPayload: { line: "fnapi_harness_status_service.cc:212" message: "Unable to retrieve status info from SDK harness sdk-0-0_sibling_1 within allowed time." thread: "115" } labels: { compute.googleapis.com/resource_id: "7918190538793288800" compute.googleapis.com/resource_name: "df-<some_identifier>-harness-9zz3" compute.googleapis.com/resource_type: "instance" ``` the `df-<some_identifier>-harness-9zz3` would be the VM name Then, SSH to that VM from UI or via a gcloud command, log into the running python container in a privileged mode and run pystack. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org