[jira] [Commented] (STORM-297) Storm Performance cannot be scaled up by adding more CPU cores

ASF GitHub Bot (JIRA) Thu, 05 Jun 2014 07:31:45 -0700

    [ 
https://issues.apache.org/jira/browse/STORM-297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14018825#comment-14018825
 ]


ASF GitHub Bot commented on STORM-297:
--------------------------------------

Github user clockfly commented on the pull request:

    https://github.com/apache/incubator-storm/pull/103#issuecomment-45226047
  
    @miguno, there are several more observations I have
    
    1.  Network still not efficient enough
    
      We can see from the test report, after this fix, the throughput is still 
bottlenecked by network(CPU: 72%, network: 45%), because there are still 
margins in the CPU(28%). That’s weird because only 45% of theory network 
bandwidth is used.
    
    2.  Uneven machine message receive latency
    
      In the experiment, I noticed that there are always some machine whose 
message receive latency is much higher than the others. For example, tuples 
generated from machine A, are sent to tasks on machine B, C, D, in one run, 
tasks on B take more time to receive messages, in another run, D may be the 
slowest.
    
      My guess is that some machine has a longer netty receiver queue than the 
other machines, and the queue length on all machines becomes stable but not 
equal after some time(new input = new output) . The latency is different 
because queue length is different. Changing max.spout.pending won’t improve 
this, because it only control overall message sent from A, it doesn’t treat B, 
C, D differently.
    
    3. better max.spout.pending?
    I observed, after we tune max.spout.pending to a big enough value, 
increasing max.spout.pending will only add to latency but not throughput. When 
spout.pending doubles, the latency doubles.
    
      Can we do flow control adaptively so that we stops when there is no 
further benefit to continue increasing max.spout.pending?
    
    4.  Potential deadlock when all intermediate buffer is full
    
      Consider two worker, task1(workerA) deliver message to task3(workerB), 
task3 deliver to task2(workerA). There is a loop! It is possible that all 
worker sender/receiver buffer will be full and block. 
    
      
![vvvv](https://cloud.githubusercontent.com/assets/2595532/3188775/ba645bdc-ecbd-11e3-959b-dfb8208d4d1b.png)
    
      The current work-around in storm is tricky, it use a unbounded receiver 
buffer(LinkedBlockingQueue) for each worker to break the loop. But this is not 
good, because the receiver buffer can potentially be very long, and latency be 
very high.
    
    5.  Is it necessary for each task to have a dedicated send queue thread?
      Currently, each task has a dedicated send queue thread to push data to 
worker transfer queue. During the profiling, the task send queue thread is 
usually at wait state. Maybe it is a good idea to use a shared thread pool 
replace dedicated thread?
    
    6.  Acker workload very high.
      In the test, I spotted that the acker task is very busy. As each message 
size is small(100 byte), there are hugh amout of tuples need to be acked. 
    
      Can this acker cost be reduced? 
    
      For example, we can group the tuple at spout to time slice, and each time 
slice will share a same root tuple Id. For example, the time slice can be 
100ms, and there are 10, 000 message in this slice, all share same root id, 
before sending to acker task, we can first XOR all acker message of same root 
Id locally on each worker. In that case, we may can reduce the acking network 
and task cost. The drawback is that when a message is lost, we need to reply 
all message in this slice.
    
    7.  Worker receive thread blocked by task receiver queue
    
      In worker receiver thread, it will try to publish the messages to the 
receive queue of each task sequentially in a blocking way. If one task receiver 
queue is full, the thread will block and wait.



> Storm Performance cannot be scaled up by adding more CPU cores
> --------------------------------------------------------------
>
>                 Key: STORM-297
>                 URL: https://issues.apache.org/jira/browse/STORM-297
>             Project: Apache Storm (Incubating)
>          Issue Type: Bug
>            Reporter: Sean Zhong
>              Labels: Performance, netty
>             Fix For: 0.9.2-incubating
>
>         Attachments: Storm_performance_fix.pdf, 
> storm_Netty_receiver_diagram.png, storm_conf.txt, 
> storm_performance_fix.patch, worker_throughput_without_storm-297.png
>
>
> We cannot scale up the performance by adding more CPU cores and increasing 
> parallelism.
> For a 2 layer topology Spout ---shuffle grouping--> bolt, when message size 
> is small (around 100 bytes), we can find in the below picture that neither 
> the CPU nor the network is saturated. When message size is 100 bytes, only 
> 40% of CPU is used, only 18% of network is used, although we have a high 
> parallelism (overall we have 144 executors)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (STORM-297) Storm Performance cannot be scaled up by adding more CPU cores

Reply via email to