Github user clockfly commented on the pull request:
https://github.com/apache/incubator-storm/pull/103#issuecomment-45052822
Hi @Gvain,
First thank you for your insistence! it really helps to gain more
understanding of storm.
I run with your approach, the result verified your saying, the throughput
does increase as worker number increase.
Test
-------------
**Throughput vs. worker# â **
| Worker# |cluster network IN(MB/s)|spout throughput(msg/s)|overall
CPU|user cpu|system cpu|total threads#â¡ |
| ------------- |:-------------:| -----:|-----:|-----:|-----:|-----:|-----:|
| 4 | 72 | 391860 | 42% | 36% | 6%| 422 |
| 8 | 88.3 | 475007 | 42% | 35% | 7% | 554 |
| 12 | 104 | 555174 | 51% | 40% | 11% | 686 |
| 16 | 116 | 603399 | 59% | 46% | 13% | 818 |
| 24 | 130 | 622938 | 77% | 55% | 22% | 1082 |
| 4(storm 297) | 140 | 752479 | 74% | 67.80% | 5.20% | 434 |
â Test environment: node=4, 48 vcore on each machine, max.spout.pendings
= 1000, CPU: E52680, 48 spout, 48 bolt, and 48 ackers.
â¡ We will only count 1 gc thread 1 jit thread for each JVM.

**CPU Usage**

We can find out both throughput and CPU usage increase when worker number
increase.
Analysis
-------------------
The facts revealed by this test strengthened my conviction that it is even
better to apply this patch for higher performance:
1. **Higher performance with this patch.**
| | Worker# |cluster network IN(MB/s)|spout through put(msg/s)|
| ------------- |:-------------:| -----:|-----:|-----:|
| no storm-297 | 24 | 130 | 622938 |
| with storm 297 | 4 | 140 | 752479 |
**4 worker** with storm-297 can process **20% more** message than **24
workers** without this patch, with **less** CPU consumption.
2. **Storm cannot scale well by changing task parallism solely**
As the data in your test showed, for 4 worker, we can only reach 56% CPU.
with the facts that there are 36 task parallism for each worker, much lagger
than CPU core# 24.
**The worker has inherit bottlenecks, ths issues are there. Work-arounds
won't fix those issues.**
3. **CPU System time increase when worker number increase**
| Worker# |user cpu|system cpu|
| ------------- | -----:|-----:|
| 4 | 36% | 6%|
| 8 | 35% | 7% |
| 12 | 40% | 11% |
| 16 | 46% | 13% |
| 24 | 55% | **22%** |
| 4(with storm-297) | 67.80% | 5.20% |
Unnormal high system CPU is not good.
4. **JVM allocation cost is high.**
In the test, there are 16+ threads running for each JVM allocation. The
supporting data can be found at
5. **Worker allocation is not cost free**
Besides the JVM allocation cost, there are zookeeper related threads(4),
hearbeat threads(5), system bolt(2), netty boss and worker threads(6, 2 boss, 2
worker, and 2 timer). Plus with JVM threads, each worker will add at least 33
threads.
More threads will add more pressure to central service like nimbus, and
zookeeper.
More threads means more context switch, it will hurt the performance of
all applications running on these server.
| Worker# |cluster total threads#|
| ------------- |:-------------:|
| 4 | 422 |
| 8 | 554 |
| 12 | 686 |
| 16 | 818 |
| 24 | **1082** |
| 4(with storm-297) | 434 |
(We will only count 1 gc thread 1 jit thread for each JVM)
For 24 workers case, the overall thread number is 1082+, with **each
machine having more than 270 threads!**
6. **Serialization and Deserialization Cost**
When the message is delivered from the task in the same process, the
tuple won't be serialized.
When there are 4 worker(suppose shuffle grouping and task is even
distributed), there are 1/4 message that don't need serialization, but for 24
workers, the ratio is 1/24, which means we are now need to serialize 28% more
messages.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---