[jira] [Commented] (STORM-297) Storm Performance cannot be scaled up by adding more CPU cores

ASF GitHub Bot (JIRA) Tue, 03 Jun 2014 22:30:07 -0700

    [ 
https://issues.apache.org/jira/browse/STORM-297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14017384#comment-14017384
 ]


ASF GitHub Bot commented on STORM-297:
--------------------------------------

Github user clockfly commented on the pull request:

    https://github.com/apache/incubator-storm/pull/103#issuecomment-45052822
  
    Hi @Gvain,
    
    First thank you for your insistence! it really helps to gain more 
understanding of storm.  
    I run with your approach, the result verified your saying, the throughput 
does increase as worker number increase.
    
    Test
    -------------
    
    **Throughput vs. worker# ①**
    
    | Worker# |cluster network IN(MB/s)|spout throughput(msg/s)|overall 
CPU|user cpu|system cpu|total threads#② |
    | ------------- |:-------------:| -----:|-----:|-----:|-----:|-----:|-----:|
    | 4 | 72 | 391860 | 42% | 36% | 6%| 422 |
    | 8 | 88.3 | 475007 | 42% | 35%     | 7%    | 554 | 
    | 12 |      104 | 555174 | 51% | 40% | 11% | 686 |
    | 16 | 116 | 603399 | 59% | 46% | 13% | 818 |  
    | 24 |      130 | 622938 | 77% | 55% | 22% | 1082 | 
    | 4(storm 297) |    140 |   752479 | 74% | 67.80% | 5.20% | 434 |
    
    ①   Test environment: node=4, 48 vcore on each machine, max.spout.pendings 
= 1000, CPU: E52680, 48 spout, 48 bolt, and 48 ackers.
    ②   We will only count 1 gc thread 1 jit thread for each JVM.
    
    
    
    
![worker_throughput_without_storm-297](https://cloud.githubusercontent.com/assets/2595532/3169502/82447702-eb9e-11e3-9f97-bc29fde190f3.png)
    
    
    **CPU Usage**
    
    ![cpu_worker 
_scability](https://cloud.githubusercontent.com/assets/2595532/3169503/afdac11c-eb9e-11e3-9207-9d524606f613.png)
    
    We can find out both throughput and CPU usage increase when worker number 
increase.
    
    Analysis
    -------------------
    The facts revealed by this test strengthened my conviction that it is even 
better to apply this patch for higher performance:
    
    1. **Higher performance with this patch.**
    
      | | Worker# |cluster network IN(MB/s)|spout through put(msg/s)|
    | ------------- |:-------------:| -----:|-----:|-----:|
    | no storm-297 | 24 |       130 | 622938 |
    | with storm 297 | 4 |      140 |   752479 |
    
      **4 worker** with storm-297 can process **20% more** message than **24 
workers** without this patch, with **less** CPU consumption.
    
    2. **Storm cannot scale well by changing task parallism solely**
    
      As the data in your test showed, for 4 worker, we can only reach 56% CPU. 
with the facts that there are 36 task parallism for each worker, much lagger 
than CPU core# 24.
    
      **The worker has inherit bottlenecks, ths issues are there. Work-arounds 
won't fix those issues.**
    
    3. **CPU System time increase when worker number increase**
    
      | Worker# |user cpu|system cpu|
    | ------------- | -----:|-----:|
    | 4 | 36% | 6%| 
    | 8 | 35%   | 7%    |
    | 12 |      40% | 11% |
    | 16 | 46% | 13% |  
    | 24 |      55% | **22%** |
    | 4(with storm-297) | 67.80% | 5.20% |
    
      Unnormal high system CPU is not good.
    
    4. **JVM allocation cost is high.**
      In the test, there are 16+ threads running for each JVM allocation. The 
supporting data can be found at  
    
    5. **Worker allocation is not cost free**
    
     Besides the JVM allocation cost, there are zookeeper related threads(4), 
hearbeat threads(5), system bolt(2), netty boss and worker threads(6, 2 boss, 2 
worker, and 2 timer). Plus with JVM threads, each worker will add at least 33 
threads. 
    
      More threads will add more pressure to central service like nimbus, and 
zookeeper.
    
      More threads means more context switch, it will hurt the performance of 
all applications running on these server. 
    
      | Worker# |cluster total threads#|
    | ------------- |:-------------:|
    | 4 | 422 |
    | 8 | 554 | 
    | 12 | 686 |
    | 16 | 818 |  
    | 24 | **1082** | 
    | 4(with storm-297) | 434 |
    (We will only count 1 gc thread 1 jit thread for each JVM) 
    
      For 24 workers case, the overall thread number is 1082+, with **each 
machine having more than 270 threads!**
    
    
    6. **Serialization and Deserialization Cost**
    
      When the message is delivered from the task in the same process, the 
tuple won't be serialized. 
    
      When there are 4 worker(suppose shuffle grouping and task is even 
distributed), there are 1/4 message that don't need serialization, but for 24 
workers, the ratio is 1/24, which means we are now need to serialize 28% more 
messages.
    
     


> Storm Performance cannot be scaled up by adding more CPU cores
> --------------------------------------------------------------
>
>                 Key: STORM-297
>                 URL: https://issues.apache.org/jira/browse/STORM-297
>             Project: Apache Storm (Incubating)
>          Issue Type: Bug
>            Reporter: Sean Zhong
>              Labels: Performance, netty
>             Fix For: 0.9.2-incubating
>
>         Attachments: Storm_performance_fix.pdf, 
> storm_Netty_receiver_diagram.png, storm_performance_fix.patch, 
> worker_throughput_without_storm-297.png
>
>
> We cannot scale up the performance by adding more CPU cores and increasing 
> parallelism.
> For a 2 layer topology Spout ---shuffle grouping--> bolt, when message size 
> is small (around 100 bytes), we can find in the below picture that neither 
> the CPU nor the network is saturated. When message size is 100 bytes, only 
> 40% of CPU is used, only 18% of network is used, although we have a high 
> parallelism (overall we have 144 executors)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (STORM-297) Storm Performance cannot be scaled up by adding more CPU cores

Reply via email to