[
https://issues.apache.org/jira/browse/STORM-167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13877585#comment-13877585
]
Kang Xiao commented on STORM-167:
---------------------------------
hi [~az] , I've submitted a pull request on the old storm git repo
https://github.com/nathanmarz/storm/pull/741.
The core code modification is about 300 lines and many others are generated by
thrift. We have deployed this feature on production clusters about 6 months.
Any advice is welcome!
*main points:*
*1. interface changes (compatible with old versions)*
1.1 zk add :topology-version and :update-duration-sec fields to
StormBase:status map
1.2 zk add :topology-version to executor heartbeat
1.3 worker local state add versions dir to storm worker's current running
topology-version
1.4 nimbus add updateTopology interface
1.5 add topology-version field to storm.thrift three struct: TopologySummary
ExecutorSummary TopologyInfo
*2. topology update process*
2.1 storm client run "storm jar xxxx -c topology.update=true" to invoke
topology update process
2.2 storm client upload new jar file to nimbus
2.3 storm client call nimbus updateTopology interface
2.4 nimbus check the new topology and replace stormdist/storm-id dir
2.5 nimbus update StormBase in zk, set :topology-version(for destination
version) and :update-duration-sec(for all workers update process duration)
fields to StormBase:status map
2.6 supervisors check zk StormBase and do the update work if topology's local
version is not the same with zk version
2.6.1 sync-supervisor download the latest code from nimbus to local
stormdist/topology-version dir
2.6.2 each supervisor schedule the topology's worker update at a
rand(expect-max-update-time) time point
2.6.3 sync-process check local worker version, if it is not the same with
sync-supervisor downloaded version and update time point reached, set worker
state to a new :update state
2.6.4 sync-process kill workers in :update state as normally
2.6.5 sync-process restart killed worker as normally, expect that read topology
and conf from stormdist/topology-version dir
2.6.6 new worker heartbeat to zk with new topology-version, it can be displayed
on web ui to check update progress
> proposal for storm topology online update
> -----------------------------------------
>
> Key: STORM-167
> URL: https://issues.apache.org/jira/browse/STORM-167
> Project: Apache Storm (Incubating)
> Issue Type: New Feature
> Reporter: James Xu
> Priority: Minor
>
> https://github.com/nathanmarz/storm/issues/540
> Now update topology code can only be done by kill it and re-submit a new one.
> During the kill and re-submit process some request may delay or fail. It is
> not so good for online service. So we consider to add topology online update
> recently.
> Mission
> update running topology code gracefully one worker after another without
> service total interrupted. Just update topology code, not update topology DAG
> structure including component, stream and task number.
> Proposal
> * client use "storm update topology-name new-jar-file" to submit new-jar-file
> update request
> * nimbus update stormdist dir, link topology-dir to new one
> * nimbus update topology version on zk
> * the supervisors that running this topology update it
> ** check topology version on zk, if it is not the same as local version, a
> topology update begin
> ** each supervisor schedule the topology's worker update at a
> rand(expect-max-update-time) time point
> ** sync-supervisor download the latest code from nimbus
> ** sync-process check local worker heartbeat version(to be added), if it is
> not the same with sync-supervisor downloaded version, kill the worker
> ** sync-process restart killed worker
> ** new worker heartbeat to zk with version(to be added), it can be displayed
> on web ui to check update progress.
> This feature is deployed in our production clusters. It's really useful for
> topologys handling online request waiting for response. Topology jar can be
> updated without entire service offline.
> We hope that this feature is useful for others too.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)