damccorm opened a new issue, #21598:
URL: https://github.com/apache/beam/issues/21598
When I run a job with many workers (100 or more) and large shuffle sizes
(millions of records and/or several GB), my workers fail unexpectedly with
```
python -m apache_beam.runners.worker.sdk_worker_main
E0308 12:59:18.067442934 724 chttp2_transport.cc:1103]
Received a GOAWAY with error code ENHANCE_YOUR_CALM and debug data equal
to "too_many_pings"
Traceback
(most recent call last):
File "/usr/local/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/local/lib/python3.8/runpy.py", line 87,
in _run_code
exec(code, run_globals)
File
"/usr/local/lib/python3.8/site-packages/apache_beam/runners/worker/sdk_worker_main.py",
line 264, in <module>
main(sys.argv)
File
"/usr/local/lib/python3.8/site-packages/apache_beam/runners/worker/sdk_worker_main.py",
line 155, in main
sdk_harness.run()
File
"/usr/local/lib/python3.8/site-packages/apache_beam/runners/worker/sdk_worker.py",
line 234, in run
for work_request in self._control_stub.Control(get_responses()):
File "/usr/local/lib/python3.8/site-packages/grpc/_channel.py",
line 426, in __next__
return self._next()
File "/usr/local/lib/python3.8/site-packages/grpc/_channel.py",
line 826, in _next
raise self
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous
of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "Socket closed"
debug_error_string =
"{"created":"@1646744358.118371750","description":"Error received from
peer
ipv6:[::1]:34305","file":"src/core/lib/surface/call.cc","file_line":1074,"grpc_message":"Socket
closed","grpc_status":14}"
>
```
This is probably related to or even the same as BEAM-12448 or BEAM-6258, but
since one of them is already marked as fixed in a previous version and both
reports have large tails of unreadable auto-generated comments, I decided to
create a new issue.
There is not much more information I can give you, since this is all the
error output I get. It's really hard to debug and with the large number of
workers I don't even know if the worker reporting the error is actually the one
experiencing it.
Imported from Jira
[BEAM-14070](https://issues.apache.org/jira/browse/BEAM-14070). Original Jira
may contain additional context.
Reported by: phoerious.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]