Turns out that rexi_server's can die in such a way that they're not restarted. This can (and has!) left a cluster without the ability to issue RPC calls effectively rendering the cluster useless.
A slightly redacted log showing it happen due to hitting the process limit is: 2018-08-18T21:00:05.106860Z db3.clustername <0.19934.2> - gen_server '[email protected]' terminated with reason: system_limit at erlang:spawn_opt/1 <= erlang:spawn_monitor/3 <= rexi_server:handle_cast/2(line:71) <= gen_server:try_dispatch/4(line:593) <= gen_server:handle_msg/5(line:659) <= proc_lib:init_p_do_apply/3(line:237)#012 state: {st,6946959,7078032,{[],[]},0,0} [ Full content available at: https://github.com/apache/couchdb/issues/1571 ] This message was relayed via gitbox.apache.org for [email protected]
