tvalentyn commented on issue #30867:
URL: https://github.com/apache/beam/issues/30867#issuecomment-2048032874

   I am observing the pattern that the jobs you start with Beam 2.55.0 SDK have 
many errors like "Unable to retrieve status info from SDK harness". This is 
definitely concerning. The message means that Beam SDK process running inside a 
container, aka SDK harness,  becomes unresponsive to SDK Status requests from 
the runner; eventually, the runner terminates the unresponsive SDK, and worker 
might restart.
   
   These errors appear fairly early in pipeline execution. 
   
   Dataflow workers serve the SDK status page on localhost:8081/sdk_status, and 
it can be queried manually via: gcloud compute ssh --zone "xx-somezone-z" 
"some-dataflow-gce-worker-01300848-wqox-harness-bvf7" --project 
"some-project-id" --command "curl localhost:8081/sdk_status". 
   
   Would it be possible to take a closer look at the differences between 2.55.0 
and 2.53.0 setup that you have to narrow down the exact change that increases 
instances of these errors? For example: upgrading/downgrading a dependency X 
and doing nothing else increases/decreases instances of this error.  I'll also 
try to repro this issue myself. 
   
   > It also seems to manifest in the Google Cloud Console Dataflow Job viewer 
UI locking up in the browser until the browser considers tab unresponsive while 
a fixed job stays responsive..
   
   That is likely an unrelated UI issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@beam.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to