Github user revans2 commented on the issue:
https://github.com/apache/storm/pull/414
@danny0405 like @HeartSaVioR said it depends on the versions you are
upgrading between. Most of the time we have maintained wire and binary
compatibility so you can do the upgrade piecemeal. This should work between
versions of storm that have the same major version number. 1.0.0 to 1.1.0, or
1.1.0 to 1.1.2, but not 0.10.x to 1.0.0.
The procedure that we follow when doing an upgrade is to
1) shutdown and upgrade nimbus (we are not currently running HA, but if we
were step 1.b would be to upgrade the other nimbus instances one at a time)
2) pick a single node that is not upgraded yet.
2.b) install the new version of storm on the node.
2.c) shoot all the storm processes, supervisor, logviewer, and workers
2.d) clear out all of the state on the node (NOT needed every time, but we
are cautious because of bugs in the past)
2.e) relaunch the supervisor and logviewer.
3) repeat until all of the nodes are done.
For our large clusters we actually do a few nodes at a time, not one. This
procedure does have a few issues. Primarily the biggest issue is churn in the
worker processes. We try to avoid doing the upgrade a lot because it is not
truly transparent to all topologies. They recover, but they have had every one
of their worker processes shot at least one, and possibly multiple times. This
can cause data issues in non-trident topologies, and can slow down the
processing in trident.
I would recommend that you do it a little differently, and this is what we
want to move to.
for each node in parallel as much as possible install the new version of
storm then shoot the supervisor and the logviewer. Wait for them to all come
back up, or at least enough that you feel good about it.
Then again as parallel as possible shoot all of the worker processes on all
of the nodes.
This still has the disadvantage of having all of the worker processes being
shot and slowing things down, but they are guaranteed to only be shot once, and
the recovery time should be much faster. The supervisor relaunches them
quickly instead of possibly having nimbus time them out and reschedule them on
a node that has not been upgraded yet.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---