[ https://issues.apache.org/jira/browse/BEAM-14070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17550028#comment-17550028 ]
Danny McCormick commented on BEAM-14070: ---------------------------------------- This issue has been migrated to https://github.com/apache/beam/issues/21598 > Beam worker closing gRPC connection with many workers and large shuffle sizes > ----------------------------------------------------------------------------- > > Key: BEAM-14070 > URL: https://issues.apache.org/jira/browse/BEAM-14070 > Project: Beam > Issue Type: Bug > Components: sdk-py-core > Affects Versions: 2.36.0 > Reporter: Janek Bevendorff > Priority: P2 > > When I run a job with many workers (100 or more) and large shuffle sizes > (millions of records and/or several GB), my workers fail unexpectedly with > {code:java} > python -m apache_beam.runners.worker.sdk_worker_main > E0308 12:59:18.067442934 724 chttp2_transport.cc:1103] Received a > GOAWAY with error code ENHANCE_YOUR_CALM and debug data equal to > "too_many_pings" > Traceback (most recent call last): > File "/usr/local/lib/python3.8/runpy.py", line 194, in _run_module_as_main > return _run_code(code, main_globals, None, > File "/usr/local/lib/python3.8/runpy.py", line 87, in _run_code > exec(code, run_globals) > File > "/usr/local/lib/python3.8/site-packages/apache_beam/runners/worker/sdk_worker_main.py", > line 264, in <module> > main(sys.argv) > File > "/usr/local/lib/python3.8/site-packages/apache_beam/runners/worker/sdk_worker_main.py", > line 155, in main > sdk_harness.run() > File > "/usr/local/lib/python3.8/site-packages/apache_beam/runners/worker/sdk_worker.py", > line 234, in run > for work_request in self._control_stub.Control(get_responses()): > File "/usr/local/lib/python3.8/site-packages/grpc/_channel.py", line 426, in > __next__ > return self._next() > File "/usr/local/lib/python3.8/site-packages/grpc/_channel.py", line 826, in > _next > raise self > grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that > terminated with: > status = StatusCode.UNAVAILABLE > details = "Socket closed" > debug_error_string = > "{"created":"@1646744358.118371750","description":"Error received from peer > ipv6:[::1]:34305","file":"src/core/lib/surface/call.cc","file_line":1074,"grpc_message":"Socket > > closed","grpc_status":14}" > >{code} > This is probably related to or even the same as BEAM-12448 or BEAM-6258, but > since one of them is already marked as fixed in a previous version and both > reports have large tails of unreadable auto-generated comments, I decided to > create a new issue. > There is not much more information I can give you, since this is all the > error output I get. It's really hard to debug and with the large number of > workers I don't even know if the worker reporting the error is actually the > one experiencing it. -- This message was sent by Atlassian Jira (v8.20.7#820007)