Github user revans2 commented on the issue:
https://github.com/apache/storm/pull/2365
Actually I have dug deeper, and I think we want to do this differently,
although it will be just as simple of a change.
Currently a ConnectException is thrown deep down that end up being wrapped
and bubbles up to be a RuntimeException which you are catching and simply not
adding that client to the list of clients. The issue I am concerned about is
what happens if the node is temporarily down and will come back up. Before
this would cause the worker to restart until it was back up (not ideal but
would work). With this patch we will end up never having the spout connect to
that drpc server.
I think the proper fix is actually here
https://github.com/liu-zhaokun/storm/blob/d92f1a9c8d7442d4959fec57813fc5de42b179a9/storm-client/src/jvm/org/apache/storm/drpc/DRPCInvocationsClient.java#L44
If instead of letting the Exception bubble up and be handled by the Spout
we instead will catch/log it and set the AtomicReference to null. This will
allow the spout to do its reconnect logic periodically to try and recover, but
will still keep the spout up and running if a node is completely removed.
Longer term it might be interesting to try and replace the config with a
location in zookeeper where the DRPC servers can register, so we can add/remove
DRPC servers as needed and the system automatically adjusts.
---