[jira] [Commented] (STORM-2693) Topology submission or kill takes too much time when topologies grow to a few hundred

Yuzhao Chen (JIRA) Mon, 25 Sep 2017 01:10:30 -0700

    [ 
https://issues.apache.org/jira/browse/STORM-2693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16178703#comment-16178703
 ]


Yuzhao Chen commented on STORM-2693:
------------------------------------

[~kabhwan]
I have read STORM-2153 and  didn't find any promotion for heartbeats part. So i 
decide to re-design it which can be done individually apart from the metrics V2 
thing.

> Topology submission or kill takes too much time when topologies grow to a few 
> hundred
> -------------------------------------------------------------------------------------
>
>                 Key: STORM-2693
>                 URL: https://issues.apache.org/jira/browse/STORM-2693
>             Project: Apache Storm
>          Issue Type: Improvement
>          Components: storm-core
>    Affects Versions: 0.9.6, 1.0.2, 1.1.0, 1.0.3
>            Reporter: Yuzhao Chen
>              Labels: pull-request-available
>         Attachments: 2FA30CD8-AF15-4352-992D-A67BD724E7FB.png, 
> D4A30D40-25D5-4ACF-9A96-252EBA9E6EF6.png
>
>          Time Spent: 5h
>  Remaining Estimate: 0h
>
> Now for a storm cluster with 40 hosts [with 32 cores/128G memory] and 
> hundreds of topologies, nimbus submission and killing will take about minutes 
> to finish. For example, for a cluster with 300 hundred of topologies，it will 
> take about 8 minutes to submit a topology, this affect our efficiency 
> seriously.
> So, i check out the nimbus code and find two factor that will effect nimbus 
> submission/killing time for a scheduling round:
> * read existing-assignments from zookeeper for every topology [will take 
> about 4 seconds for a 300 topologies cluster]
> * read all the workers heartbeats and update the state to nimbus cache [will 
> take about 30 seconds for a 300 topologies cluster]
> the key here is that Storm now use zookeeper to collect heartbeats [not RPC], 
> and also keep physical plan [assignments] using zookeeper which can be 
> totally local in nimbus.
> So, i think we should make some changes to storm's heartbeats and assignments 
> management.
> For assignment promotion:
> 1. nimbus will put the assignments in local disk
> 2. when restart or HA leader trigger nimbus will recover assignments from zk 
> to local disk
> 3. nimbus will tell supervisor its assignment every time through RPC every 
> scheduling round
> 4. supervisor will sync assignments at fixed time
> For heartbeats promotion:
> 1. workers will report executors ok or wrong to supervisor at fixed time
> 2. supervisor will report workers heartbeats to nimbus at fixed time
> 3. if supervisor die, it will tell nimbus through runtime hook
>     or let nimbus find it through aware supervisor if is survive 
> 4. let supervisor decide if worker is running ok or invalid , supervisor will 
> tell nimbus which executors of every topology are ok



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (STORM-2693) Topology submission or kill takes too much time when topologies grow to a few hundred

Reply via email to