[
https://issues.apache.org/jira/browse/BEAM-14070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Kenneth Knowles updated BEAM-14070:
-----------------------------------
Status: Open (was: Triage Needed)
> Beam worker closing gRPC connection with many workers and large shuffle sizes
> -----------------------------------------------------------------------------
>
> Key: BEAM-14070
> URL: https://issues.apache.org/jira/browse/BEAM-14070
> Project: Beam
> Issue Type: Bug
> Components: sdk-py-core
> Affects Versions: 2.36.0
> Reporter: Janek Bevendorff
> Priority: P2
>
> When I run a job with many workers (100 or more) and large shuffle sizes
> (millions of records and/or several GB), my workers fail unexpectedly with
> {code:java}
> python -m apache_beam.runners.worker.sdk_worker_main
> E0308 12:59:18.067442934 724 chttp2_transport.cc:1103] Received a
> GOAWAY with error code ENHANCE_YOUR_CALM and debug data equal to
> "too_many_pings"
> Traceback (most recent call last):
> File "/usr/local/lib/python3.8/runpy.py", line 194, in _run_module_as_main
> return _run_code(code, main_globals, None,
> File "/usr/local/lib/python3.8/runpy.py", line 87, in _run_code
> exec(code, run_globals)
> File
> "/usr/local/lib/python3.8/site-packages/apache_beam/runners/worker/sdk_worker_main.py",
> line 264, in <module>
> main(sys.argv)
> File
> "/usr/local/lib/python3.8/site-packages/apache_beam/runners/worker/sdk_worker_main.py",
> line 155, in main
> sdk_harness.run()
> File
> "/usr/local/lib/python3.8/site-packages/apache_beam/runners/worker/sdk_worker.py",
> line 234, in run
> for work_request in self._control_stub.Control(get_responses()):
> File "/usr/local/lib/python3.8/site-packages/grpc/_channel.py", line 426, in
> __next__
> return self._next()
> File "/usr/local/lib/python3.8/site-packages/grpc/_channel.py", line 826, in
> _next
> raise self
> grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that
> terminated with:
> status = StatusCode.UNAVAILABLE
> details = "Socket closed"
> debug_error_string =
> "{"created":"@1646744358.118371750","description":"Error received from peer
> ipv6:[::1]:34305","file":"src/core/lib/surface/call.cc","file_line":1074,"grpc_message":"Socket
>
> closed","grpc_status":14}"
> >{code}
> This is probably related to or even the same as BEAM-12448 or BEAM-6258, but
> since one of them is already marked as fixed in a previous version and both
> reports have large tails of unreadable auto-generated comments, I decided to
> create a new issue.
> There is not much more information I can give you, since this is all the
> error output I get. It's really hard to debug and with the large number of
> workers I don't even know if the worker reporting the error is actually the
> one experiencing it.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)