Github user clockfly commented on the pull request:
https://github.com/apache/storm/pull/268#issuecomment-59676137
```
When target worker is down, the data sending to other target worker should
not be blocked.
The approach we currently using is to drop messages when connection to
target worker is not available.
```
This solution may need further discussions:
approach A(adopted in current patch):
If we drop the message, the dropped message may takes up to 30 seconds to
be replayed(depending on config topology.message.timeout.secs).
At the same time it is safer for current worker (no OOM, especially for
unacked topology), and messages dispatching to other workers(no blocking).
approach B:
If we do buffering in the netty client, the latency varies in two case:
case1: target worker is alive, we are doing re-connecting, and the
reconnection will eventually succeed. The latency includes the time to connect
to current worker, and the time interval of flusher.
case 2: target worker is not alive. but the source worker have not be
aware of that. in this case, the latency will be same as approach A(30 seconds
by default)
approach C:
@HeartSaVioR raised that it may be more reasonble to buffer the message
outside of netty client. Better buffered in a map which can be retrived with
task Id, so that we can still recover messages to target taskId, if the mapping
from taskId to worker changes.
For this approach, it will requires the messaging layer user(netty client
user) know the status of connection(possible with new interface
ConnectionWithStatus). And it need larger change in clojure. (For efficiency
and performance, we want to group messages to same target host together).
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---