tvalentyn commented on issue #30867: URL: https://github.com/apache/beam/issues/30867#issuecomment-2048032874
I am observing the pattern that the jobs you start with Beam 2.55.0 SDK have many errors like "Unable to retrieve status info from SDK harness". This is definitely concerning. The message means that Beam SDK process running inside a container, aka SDK harness, becomes unresponsive to SDK Status requests from the runner; eventually, the runner terminates the unresponsive SDK, and worker might restart. These errors appear fairly early in pipeline execution. Dataflow workers serve the SDK status page on localhost:8081/sdk_status, and it can be queried manually via: gcloud compute ssh --zone "xx-somezone-z" "some-dataflow-gce-worker-01300848-wqox-harness-bvf7" --project "some-project-id" --command "curl localhost:8081/sdk_status". Would it be possible to take a closer look at the differences between 2.55.0 and 2.53.0 setup that you have to narrow down the exact change that increases instances of these errors? For example: upgrading/downgrading a dependency X and doing nothing else increases/decreases instances of this error. I'll also try to repro this issue myself. > It also seems to manifest in the Google Cloud Console Dataflow Job viewer UI locking up in the browser until the browser considers tab unresponsive while a fixed job stays responsive.. That is likely an unrelated UI issue. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org