[jira] [Commented] (STORM-162) Load Balancing Shuffle Grouping

ASF GitHub Bot (JIRA) Tue, 03 Nov 2015 11:46:44 -0800

    [ 
https://issues.apache.org/jira/browse/STORM-162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14987968#comment-14987968
 ]


ASF GitHub Bot commented on STORM-162:
--------------------------------------

Github user zhuoliu commented on a diff in the pull request:

    https://github.com/apache/storm/pull/847#discussion_r43797156
  
    --- Diff: storm-core/src/clj/backtype/storm/daemon/worker.clj ---
    @@ -310,6 +313,26 @@
         [node (Integer/valueOf port-str)]
         ))
     
    +(defn mk-refresh-load [worker]
    +  (let [local-tasks (set (:task-ids worker))
    +        remote-tasks (set/difference (worker-outbound-tasks worker) 
local-tasks)
    +        short-executor-receive-queue-map 
(:short-executor-receive-queue-map worker)
    +        next-update (atom 0)]
    +    (fn this
    +      ([]
    +        (let [^LoadMapping load-mapping (:load-mapping worker)
    +              local-pop (map-val (fn [queue]
    +                                   (let [q-metrics (.getMetrics queue)]
    +                                     (/ (double (.population q-metrics)) 
(.capacity q-metrics))))
    +                                 short-executor-receive-queue-map)
    +              remote-load (reduce merge (for [[np conn] 
@(:cached-node+port->socket worker)] (into {} (.getLoad conn remote-tasks))))
    +              now (System/currentTimeMillis)]
    +          (.setLocal load-mapping local-pop)
    +          (.setRemote load-mapping remote-load)
    +          (when (> now @next-update)
    +            (.sendLoadMetrics (:receiver worker) local-pop)
    +            (reset! next-update (+ 5000 now))))))))
    --- End diff --
    
    We may bind the "5000" as a value in let[].


> Load Balancing Shuffle Grouping
> -------------------------------
>
>                 Key: STORM-162
>                 URL: https://issues.apache.org/jira/browse/STORM-162
>             Project: Apache Storm
>          Issue Type: Wish
>          Components: storm-core
>            Reporter: James Xu
>            Assignee: Robert Joseph Evans
>            Priority: Minor
>
> https://github.com/nathanmarz/storm/issues/571
> Hey @nathanmarz,
> I think that the current shuffle grouping is creating very obvious hot-spots 
> in load on hosts here at Twitter. The reason is that randomized message 
> distribution to the workers is susceptible to the balls and bins problem:
> http://pages.cs.wisc.edu/~shuchi/courses/787-F07/scribe-notes/lecture07.pdf
> the odds that some particular queue gets bogged down when you're assigning 
> tasks randomly is high. You can solve this problem with a load-aware shuffle 
> grouping -- when shuffling, prefer tasks with lower load.
> What would it take to implement this feature?
> ----------
> sritchie: Looks like Rap Genius was heavily affected when Heroku started 
> running a "shuffle grouping" on tasks to its dynos:
> http://rapgenius.com/James-somers-herokus-ugly-secret-lyrics
> 50x performance degradation over a more intelligent load-balancing scheme 
> that only sent tasks to non-busy dynos. Seems very relevant to Storm.
> ----------
> nathanmarz: It's doing randomized round robin, not fully random distribution. 
> So every downstream task gets the same number of messages. But yes, I agree 
> that this would be a great feature. Basically what this requires is making 
> stats of downstream tasks available to the stream grouping code. The best way 
> to implement this would be:
> Implement a broadcast message type in the networking code, so that one can 
> efficiently send a large object to all tasks in a worker (rather than having 
> to send N copies of that large message)
> Have a single executor in every topology that polls nimbus for accumulated 
> stats once per minute and then broadcasts that information to all tasks in 
> all workers
> Wire up the task code to pass that information along from the task to the 
> outgoing stream groupings for that task (and adding appropriate methods to 
> the CustomStreamGrouping interface to receive the stats info)
> ----------
> sorenmacbeth: @nathanmarz @sritchie Did any progress ever get made on this? 
> Is the description above still relevant to Storm 0.9.0. We are getting bitten 
> by this problem and would love to see something like this implemented.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (STORM-162) Load Balancing Shuffle Grouping

Reply via email to