[ 
https://issues.apache.org/jira/browse/STORM-155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14063224#comment-14063224
 ] 

Masatake Iwasaki commented on STORM-155:
----------------------------------------

Recently I met this issue. Though supervisor crashed and the message "Should 
not have multiple topologies assigned to one port" was logged, the assignments 
of topologies logged by ninbus was consistent as a whole. The cause seems to be 
reading assigments while updating them as commented above. 
How about retrying synchronize-supervisor some times before throwing 
RuntimeException? Restarting supervisor is possible workaround but it causes 
some loss.


> Storm rebalancing code causes multiple topologies assigned to a single port
> ---------------------------------------------------------------------------
>
>                 Key: STORM-155
>                 URL: https://issues.apache.org/jira/browse/STORM-155
>             Project: Apache Storm (Incubating)
>          Issue Type: Bug
>            Reporter: James Xu
>
> https://github.com/nathanmarz/storm/issues/551
> We're seeing an issue when rebalancing topologies on the clusters causes 
> workers being assigned to multiple topologies which causes supervisors to 
> fail. This can be easily reproduced locally by starting a single supervisor 
> with 4 workers and nimbus, running several topologies and rebalancing all of 
> them to use 1 worker.
> I tracked the issue to the mk-assignments function in nimbus.clj. In this 
> function, the "existing-assignments" binding is assigned a list of all 
> topologies with assignments except the one being rebalanced. The comment 
> implies this is done to treat all the workers of the topology being 
> rebalanced as unused, and that's what actually happens. However, the lack of 
> the topology being rebalanced in the "existing-assignments" list causes this 
> topology being ignore completely by scheduler and other code, so as the 
> result all workers assigned to that topology will be taken over by other 
> topologies, but
> no changes to the topology being rebalanced will be made, effective making 
> all it's workers
> assigned to 2 topologies.
> I think that was not the intended behavior. I made a small change to treat 
> all the workers of the topology being rebalanced dead instead, causing them 
> to be reassigned fairly between all the topologies. This seems to work well 
> and reliably for us.
> Let me know what do you think.
> --- a/src/clj/backtype/storm/daemon/nimbus.clj
> +++ b/src/clj/backtype/storm/daemon/nimbus.clj
> @@ -437,8 +437,11 @@
>    (into {} (for [[tid assignment] existing-assignments
>                   :let [topology-details (.getById topologies tid)
>                         all-executors (topology->executors tid)
> +                       ;; for the topology which wants rebalance (specified 
> by the scratch-topology-id)
> +                       ;; we consider all its execturors to be dead so they 
> will be treated
> +                       ;; as free slots in the scheduler code.
>                         alive-executors (if (and scratch-topology-id (= 
> scratch-topology-id tid))
> -                                         all-executors
> +                                         (set nil)
>                                           (set (alive-executors nimbus 
> topology-details all-executors assignment)))]]
>               {tid alive-executors})))
> @@ -638,11 +641,7 @@
>          ;; read all the assignments
>          assigned-topology-ids (.assignments storm-cluster-state nil)
>          existing-assignments (into {} (for [tid assigned-topology-ids]
> -                                        ;; for the topology which wants 
> rebalance (specified by the scratch-topology-id)
> -                                        ;; we exclude its assignment, 
> meaning that all the slots occupied by its assignment
> -                                        ;; will be treated as free slot in 
> the scheduler code.
> -                                        (when (or (nil? scratch-topology-id) 
> (not= tid scratch-topology-id))
> -                                          {tid (.assignment-info 
> storm-cluster-state tid nil)})))
> +                                        {tid (.assignment-info 
> storm-cluster-state tid nil)}))
>          ;; make the new assignments for topologies
>          topology->executor->node+port 
> (compute-new-topology->executor->node+port
>                                         nimbus
> ----------
> xumingming: The following code needs to be synchonized(mk-assignments):
> (defn do-rebalance [nimbus storm-id status]
>   (.update-storm! (:storm-cluster-state nimbus)
>                   storm-id
>                   (assoc-non-nil
>                     {:component->executors (:executor-overrides status)}
>                     :num-workers
>                     (:num-workers status)))
>   (mk-assignments nimbus :scratch-topology-id storm-id))
> otherwise it will cause race condition here: 
> https://github.com/nathanmarz/storm/blob/master/storm-core/src/jvm/backtype/storm/scheduler/Cluster.java#L264
> then one port will be assigned to multiple topologies.
> ----------
> stass: That's not the issue here. I initially thought it was a race condition 
> as well, but my problem with rebalancing was in the sequential part of the 
> algorithm (as described in analysis). Needless to say, it fixed the issue for 
> us and I have not seen any multiple assignments in months when running with 
> the patch I submitted.
> ----------
> d2r: We are also hitting this issue and would welcome a fix.
> ----------
> nathanmarz: The problem with the proposed patch is that it doesn't treat the 
> slots of the topology that's rebalancing as free. It will certainly consider 
> its executors as dead, but it won't make use of its slots during 
> re-scheduling.
> ----------
> stass: Hi, Nathan, thanks for comment.
> If my reading of the code is right, the workers of the topology being 
> rebalanced will be treated as free since they will be marked as dead (that's 
> the purpose of the top hunk of the diff), and the scheduling code only looks 
> at free workers when assigning workers to topologies. That is also what I 
> observed in practice with this patch running.
> Am I missing something?
> ----------
> vinceyang: @nathanmarz , We are also hitting this issue , I read the nimbus 
> code,it seems when do rebalance , in mk-assignment function,the rebalancing 
> topolgy's assignment info has out of date , but supervisor not kown this 
> information. when other topolgy's assignment has the same port with the out 
> of date assignment the problem occur. if we remove the out of date assignment 
> in ZK this problem will not occur. if my Idea is OK,I will work on it to fix 
> this issue.
> ----------
> revans2: We have hit this issue too, and so I have been looking into it. It 
> seems that it can happen in two different situations.
> First a topology is not assigned anything after it previously had slots 
> assigned to it.
> This happens most commonly when re-balancing because the scheduler is not 
> aware the rebalanced topology had anything assigned to it previously, but I 
> have been able to reproduce this with other hacked up schedulers.
> When this happens the supervisor in question will crash continuously until 
> one of the topologies is killed.
> The fix seems to be that we should include assigned-topology-ids in 
> topology->executor->node+port when missing but with the topology pointing to 
> nil.
> Second the supervisor uses partially written scheduling data from ZK.
> (.set-assignment! storm-cluster-state topology-id assignment) is atomic for a 
> single topology, but not for multiple topologies. This means that the 
> supervisor can read data from ZK that has had some topologies updated, but 
> not all of them.
> When this happens the supervisor will crash and then come back up and recover 
> because the rest of the scheduling data was written to ZK.
> The fix for this seems to be that we need to "lock" zookeeper with a watch 
> file during the update. The supervisors would not read the data until nimbus 
> is done updating. I don't think this is as critical to fix because the 
> supervisor recovers fairly quickly.
> Does my analysis seem correct? I don't understand all of the code perfectly, 
> so I want to be sure I am not missing something.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to