[
https://issues.apache.org/jira/browse/STORM-297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14018825#comment-14018825
]
ASF GitHub Bot commented on STORM-297:
--------------------------------------
Github user clockfly commented on the pull request:
https://github.com/apache/incubator-storm/pull/103#issuecomment-45226047
@miguno, there are several more observations I have
1. Network still not efficient enough
We can see from the test report, after this fix, the throughput is still
bottlenecked by network(CPU: 72%, network: 45%), because there are still
margins in the CPU(28%). That’s weird because only 45% of theory network
bandwidth is used.
2. Uneven machine message receive latency
In the experiment, I noticed that there are always some machine whose
message receive latency is much higher than the others. For example, tuples
generated from machine A, are sent to tasks on machine B, C, D, in one run,
tasks on B take more time to receive messages, in another run, D may be the
slowest.
My guess is that some machine has a longer netty receiver queue than the
other machines, and the queue length on all machines becomes stable but not
equal after some time(new input = new output) . The latency is different
because queue length is different. Changing max.spout.pending won’t improve
this, because it only control overall message sent from A, it doesn’t treat B,
C, D differently.
3. better max.spout.pending?
I observed, after we tune max.spout.pending to a big enough value,
increasing max.spout.pending will only add to latency but not throughput. When
spout.pending doubles, the latency doubles.
Can we do flow control adaptively so that we stops when there is no
further benefit to continue increasing max.spout.pending?
4. Potential deadlock when all intermediate buffer is full
Consider two worker, task1(workerA) deliver message to task3(workerB),
task3 deliver to task2(workerA). There is a loop! It is possible that all
worker sender/receiver buffer will be full and block.

The current work-around in storm is tricky, it use a unbounded receiver
buffer(LinkedBlockingQueue) for each worker to break the loop. But this is not
good, because the receiver buffer can potentially be very long, and latency be
very high.
5. Is it necessary for each task to have a dedicated send queue thread?
Currently, each task has a dedicated send queue thread to push data to
worker transfer queue. During the profiling, the task send queue thread is
usually at wait state. Maybe it is a good idea to use a shared thread pool
replace dedicated thread?
6. Acker workload very high.
In the test, I spotted that the acker task is very busy. As each message
size is small(100 byte), there are hugh amout of tuples need to be acked.
Can this acker cost be reduced?
For example, we can group the tuple at spout to time slice, and each time
slice will share a same root tuple Id. For example, the time slice can be
100ms, and there are 10, 000 message in this slice, all share same root id,
before sending to acker task, we can first XOR all acker message of same root
Id locally on each worker. In that case, we may can reduce the acking network
and task cost. The drawback is that when a message is lost, we need to reply
all message in this slice.
7. Worker receive thread blocked by task receiver queue
In worker receiver thread, it will try to publish the messages to the
receive queue of each task sequentially in a blocking way. If one task receiver
queue is full, the thread will block and wait.
> Storm Performance cannot be scaled up by adding more CPU cores
> --------------------------------------------------------------
>
> Key: STORM-297
> URL: https://issues.apache.org/jira/browse/STORM-297
> Project: Apache Storm (Incubating)
> Issue Type: Bug
> Reporter: Sean Zhong
> Labels: Performance, netty
> Fix For: 0.9.2-incubating
>
> Attachments: Storm_performance_fix.pdf,
> storm_Netty_receiver_diagram.png, storm_conf.txt,
> storm_performance_fix.patch, worker_throughput_without_storm-297.png
>
>
> We cannot scale up the performance by adding more CPU cores and increasing
> parallelism.
> For a 2 layer topology Spout ---shuffle grouping--> bolt, when message size
> is small (around 100 bytes), we can find in the below picture that neither
> the CPU nor the network is saturated. When message size is 100 bytes, only
> 40% of CPU is used, only 18% of network is used, although we have a high
> parallelism (overall we have 144 executors)
--
This message was sent by Atlassian JIRA
(v6.2#6252)