tvalentyn commented on issue #30867:
URL: https://github.com/apache/beam/issues/30867#issuecomment-2048596523

   @DerRidda would it be possible for you to re-run your job using Beam 2.55 
SDK, then find a stuck worker VM, and retrieve stacktraces with pystack as I 
did in: https://github.com/apache/beam/issues/30867#issuecomment-2048463960 ?  
Note: you might have to use Dataflow Classic, I am not certain if SSHing into 
workers is possible with Dataflow Prime.
   
   To find a Stuck VM, look for "Unable to retrieve status info from SDK 
harness." logs, then find which worker emits those logs by expanding the 
logging entry in Cloud Logging. you might see something like:
   
   ```
   {
   insertId: "7918190538793288800:164250:0:18926"
   jsonPayload: {
   line: "fnapi_harness_status_service.cc:212"
   message: "Unable to retrieve status info from SDK harness sdk-0-0_sibling_1 
within allowed time."
   thread: "115"
   }
   labels: {
   compute.googleapis.com/resource_id: "7918190538793288800"
   compute.googleapis.com/resource_name: "df-<some_identifier>-harness-9zz3"
   compute.googleapis.com/resource_type: "instance"
   ```
   
   the `df-<some_identifier>-harness-9zz3` would be the VM name
   
   Then, SSH to that VM from UI or via a gcloud command, log into the running 
python container in a privileged mode and run pystack. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@beam.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to