[jira] [Updated] (BEAM-14070) Beam worker closing gRPC connection with many workers and large shuffle sizes

Janek Bevendorff (Jira) Tue, 08 Mar 2022 05:46:10 -0800


     [ 
https://issues.apache.org/jira/browse/BEAM-14070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Janek Bevendorff updated BEAM-14070:
------------------------------------
    Description: 
When I run a job with many workers (100 or more) and large shuffle sizes 
(millions of records and/or several GB), my workers fail unexpectedly with

 
{code:java}
python -m apache_beam.runners.worker.sdk_worker_main 
E0308 12:59:18.067442934     724 chttp2_transport.cc:1103]   Received a GOAWAY 
with error code ENHANCE_YOUR_CALM and debug data equal to "too_many_pings" 
Traceback (most recent call last): 
 File "/usr/local/lib/python3.8/runpy.py", line 194, in _run_module_as_main 
   return _run_code(code, main_globals, None, 
 File "/usr/local/lib/python3.8/runpy.py", line 87, in _run_code 
   exec(code, run_globals) 
 File 
"/usr/local/lib/python3.8/site-packages/apache_beam/runners/worker/sdk_worker_main.py",
 line 264, in <module> 
   main(sys.argv) 
 File 
"/usr/local/lib/python3.8/site-packages/apache_beam/runners/worker/sdk_worker_main.py",
 line 155, in main 
   sdk_harness.run() 
 File 
"/usr/local/lib/python3.8/site-packages/apache_beam/runners/worker/sdk_worker.py",
 line 234, in run 
   for work_request in self._control_stub.Control(get_responses()): 
 File "/usr/local/lib/python3.8/site-packages/grpc/_channel.py", line 426, in 
__next__ 
   return self._next() 
 File "/usr/local/lib/python3.8/site-packages/grpc/_channel.py", line 826, in 
_next 
   raise self 
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that 
terminated with: 
       status = StatusCode.UNAVAILABLE 
       details = "Socket closed" 
       debug_error_string = 
"{"created":"@1646744358.118371750","description":"Error received from peer 
ipv6:[::1]:34305","file":"src/core/lib/surface/call.cc","file_line":1074,"grpc_message":"Socket
 
closed","grpc_status":14}" 
>{code}
This is probably related to or even the same as BEAM-12448 or BEAM-6258, but 
since one of them is already marked as fixed in a previous version and both 
reports have a large tail of unreadable auto-generated comments, I decided to 
create a new issue.

There is not much more information I give you, since this is all the error 
output I get. It's really hard to debug and with the large number of workers I 
don't even know if the worker reporting the error is actually the one 
experiencing it.

  was:
When I run a job with many workers (100 or more) and large shuffle sizes 
(millions of records and/or several GB), my workers fail unexpectedly with

 
{code:java}
python -m apache_beam.runners.worker.sdk_worker_main 
E0308 12:59:18.067442934     724 chttp2_transport.cc:1103]   Received a GOAWAY 
with error code ENHANCE_YOUR_CALM and debug data equal to "too_many_pings" 
Traceback (most recent call last): 
 File "/usr/local/lib/python3.8/runpy.py", line 194, in _run_module_as_main 
   return _run_code(code, main_globals, None, 
 File "/usr/local/lib/python3.8/runpy.py", line 87, in _run_code 
   exec(code, run_globals) 
 File 
"/usr/local/lib/python3.8/site-packages/apache_beam/runners/worker/sdk_worker_main.py",
 line 264, in <module> 
   main(sys.argv) 
 File 
"/usr/local/lib/python3.8/site-packages/apache_beam/runners/worker/sdk_worker_main.py",
 line 155, in main 
   sdk_harness.run() 
 File 
"/usr/local/lib/python3.8/site-packages/apache_beam/runners/worker/sdk_worker.py",
 line 234, in run 
   for work_request in self._control_stub.Control(get_responses()): 
 File "/usr/local/lib/python3.8/site-packages/grpc/_channel.py", line 426, in 
__next__ 
   return self._next() 
 File "/usr/local/lib/python3.8/site-packages/grpc/_channel.py", line 826, in 
_next 
   raise self 
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that 
terminated with: 
       status = StatusCode.UNAVAILABLE 
       details = "Socket closed" 
       debug_error_string = 
"{"created":"@1646744358.118371750","description":"Error received from peer 
ipv6:[::1]:34305","file":"src/core/lib/surface/call.cc","file_line":1074,"grpc_message":"Socket
 
closed","grpc_status":14}" 
>{code}
This is probably related to or even the same as BEAM-12448 or BEAM-6258, but 
since one of them is already marked as fixed in a previous version and both 
reports have a large tail of unreadable auto-generated comments, I decided to 
create a new issue.

 


> Beam worker closing gRPC connection with many workers and large shuffle sizes
> -----------------------------------------------------------------------------
>
>                 Key: BEAM-14070
>                 URL: https://issues.apache.org/jira/browse/BEAM-14070
>             Project: Beam
>          Issue Type: Bug
>          Components: sdk-py-core
>    Affects Versions: 2.36.0
>            Reporter: Janek Bevendorff
>            Priority: P2
>
> When I run a job with many workers (100 or more) and large shuffle sizes 
> (millions of records and/or several GB), my workers fail unexpectedly with
>  
> {code:java}
> python -m apache_beam.runners.worker.sdk_worker_main 
> E0308 12:59:18.067442934     724 chttp2_transport.cc:1103]   Received a 
> GOAWAY with error code ENHANCE_YOUR_CALM and debug data equal to 
> "too_many_pings" 
> Traceback (most recent call last): 
>  File "/usr/local/lib/python3.8/runpy.py", line 194, in _run_module_as_main 
>    return _run_code(code, main_globals, None, 
>  File "/usr/local/lib/python3.8/runpy.py", line 87, in _run_code 
>    exec(code, run_globals) 
>  File 
> "/usr/local/lib/python3.8/site-packages/apache_beam/runners/worker/sdk_worker_main.py",
>  line 264, in <module> 
>    main(sys.argv) 
>  File 
> "/usr/local/lib/python3.8/site-packages/apache_beam/runners/worker/sdk_worker_main.py",
>  line 155, in main 
>    sdk_harness.run() 
>  File 
> "/usr/local/lib/python3.8/site-packages/apache_beam/runners/worker/sdk_worker.py",
>  line 234, in run 
>    for work_request in self._control_stub.Control(get_responses()): 
>  File "/usr/local/lib/python3.8/site-packages/grpc/_channel.py", line 426, in 
> __next__ 
>    return self._next() 
>  File "/usr/local/lib/python3.8/site-packages/grpc/_channel.py", line 826, in 
> _next 
>    raise self 
> grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that 
> terminated with: 
>        status = StatusCode.UNAVAILABLE 
>        details = "Socket closed" 
>        debug_error_string = 
> "{"created":"@1646744358.118371750","description":"Error received from peer 
> ipv6:[::1]:34305","file":"src/core/lib/surface/call.cc","file_line":1074,"grpc_message":"Socket
>  
> closed","grpc_status":14}" 
> >{code}
> This is probably related to or even the same as BEAM-12448 or BEAM-6258, but 
> since one of them is already marked as fixed in a previous version and both 
> reports have a large tail of unreadable auto-generated comments, I decided to 
> create a new issue.
> There is not much more information I give you, since this is all the error 
> output I get. It's really hard to debug and with the large number of workers 
> I don't even know if the worker reporting the error is actually the one 
> experiencing it.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (BEAM-14070) Beam worker closing gRPC connection with many workers and large shuffle sizes

Reply via email to