damccorm opened a new issue, #21598:
URL: https://github.com/apache/beam/issues/21598

   When I run a job with many workers (100 or more) and large shuffle sizes 
(millions of records and/or several GB), my workers fail unexpectedly with
   ```
   
   python -m apache_beam.runners.worker.sdk_worker_main 
   E0308 12:59:18.067442934     724 chttp2_transport.cc:1103]
     Received a GOAWAY with error code ENHANCE_YOUR_CALM and debug data equal 
to "too_many_pings" 
   Traceback
   (most recent call last): 
    File "/usr/local/lib/python3.8/runpy.py", line 194, in _run_module_as_main
   
      return _run_code(code, main_globals, None, 
    File "/usr/local/lib/python3.8/runpy.py", line 87,
   in _run_code 
      exec(code, run_globals) 
    File 
"/usr/local/lib/python3.8/site-packages/apache_beam/runners/worker/sdk_worker_main.py",
   line 264, in <module> 
      main(sys.argv) 
    File 
"/usr/local/lib/python3.8/site-packages/apache_beam/runners/worker/sdk_worker_main.py",
   line 155, in main 
      sdk_harness.run() 
    File 
"/usr/local/lib/python3.8/site-packages/apache_beam/runners/worker/sdk_worker.py",
   line 234, in run 
      for work_request in self._control_stub.Control(get_responses()): 
    File "/usr/local/lib/python3.8/site-packages/grpc/_channel.py",
   line 426, in __next__ 
      return self._next() 
    File "/usr/local/lib/python3.8/site-packages/grpc/_channel.py",
   line 826, in _next 
      raise self 
   grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous
   of RPC that terminated with: 
          status = StatusCode.UNAVAILABLE 
          details = "Socket closed"
   
          debug_error_string = 
"{"created":"@1646744358.118371750","description":"Error received from
   peer 
ipv6:[::1]:34305","file":"src/core/lib/surface/call.cc","file_line":1074,"grpc_message":"Socket
   
   closed","grpc_status":14}" 
   >
   ```
   
   This is probably related to or even the same as BEAM-12448 or BEAM-6258, but 
since one of them is already marked as fixed in a previous version and both 
reports have large tails of unreadable auto-generated comments, I decided to 
create a new issue.
   
   There is not much more information I can give you, since this is all the 
error output I get. It's really hard to debug and with the large number of 
workers I don't even know if the worker reporting the error is actually the one 
experiencing it.
   
   Imported from Jira 
[BEAM-14070](https://issues.apache.org/jira/browse/BEAM-14070). Original Jira 
may contain additional context.
   Reported by: phoerious.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to