scwhittle commented on PR #36528:
URL: https://github.com/apache/beam/pull/36528#issuecomment-3490061481
> > Do we have any periodic messages sent from SDK to runner that would
otherwise detect a dead channel?
>
> I tried launching a pipeline, using an SDK with @liferoad 's changes
patched, SSHing to the VM and restarting the 'harness' container to simulate
the crash. SDK detected `Socket closed` error, and restarted within a few
seconds. Logs:
>
> ```
> NOTICE 2025-11-04T22:54:50.975484Z valentyn : TTY=pts/1 ;
PWD=/home/valentyn ; USER=root ; COMMAND=/var/lib/toolbox/nerdctl -n k8s.io
restart 4a25ec1329e0
> ...
> DEFAULT 2025-11-04T22:54:52.114094879Z raise self
> DEFAULT 2025-11-04T22:54:52.114100315Z
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that
terminated with:
> DEFAULT 2025-11-04T22:54:52.114105833Z status = StatusCode.UNAVAILABLE
> DEFAULT 2025-11-04T22:54:52.114111214Z details = "Socket closed"
> DEFAULT 2025-11-04T22:54:52.114119261Z debug_error_string = "UNKNOWN:Error
received from peer ipv6:%5B::1%5D:12371
{created_time:"2025-11-04T22:54:51.032035467+00:00", grpc_status:14,
grpc_message:"Socket closed"}"
> DEFAULT 2025-11-04T22:54:52.114145002Z >
> DEFAULT 2025-11-04T22:54:52.114150544Z {"stream":"stderr"}
> DEFAULT 2025-11-04T22:54:52.114155948Z 2025/11/04 22:54:52 boot.go: error
logging message over FnAPI. endpoint localhost:12370 error: EOF message follows
> DEFAULT 2025-11-04T22:54:52.114161451Z 2025/11/04 22:54:52 WARN Python
(worker sdk-0-0_sibling_1) exited 2 times: exit status 1
> DEFAULT 2025-11-04T22:54:52.114167074Z restarting SDK process
> ...
> INFO 2025-11-04T22:55:08.835663318Z Python sdk harness starting.
> ...
> INFO 2025-11-04T22:55:10.050536Z All SDK Harnesses registered!
> ```
Thanks Valentyn. Can we clarify the motivation for this in the PR better? If
it is just perceived overhead of heartbeats, I can't imagine it is much and
this doesn't seem worth risk of adding that additional latency in some cases.
If it is to resolve unnecessary failures when we're CPU pegged that seems like
better motivation and given the testing seems safe enough to try.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]