GitHub user srdo opened a pull request:
https://github.com/apache/storm/pull/2704
STORM-1038: Upgrade to Netty 4
This is a continuation of the work done at
https://github.com/apache/storm/pull/728.
### Important changes:
* Split MessageEncoder into multiple encoders depending on the message
type, and made the stateless ones singletons
* Moved worker context termination to after the transfer thread has shut
down. Shutting the context down first looks racy to me, since the transfer
thread uses it.
* ChannelGroups automatically remove closed Channels, so
Server/PacemakerServer doesn't bother removing closed channels manually anymore
* Made some changes to TransferDrainer for readability. I didn't see any
performance impact from them, and they aren't necessary to upgrade to Netty 4.
Let me know if they should go in another PR.
# Benchmarking
I benchmarked on a single-node install. Results are accessible at
https://drive.google.com/open?id=1OuYHusyshQbzx38q5jchj8W98WG6An0L.
## Speed of light
Ran `./storm jar .\storm-perf-2.0.0-SNAPSHOT.jar
org.apache.storm.perf.ConstSpoutIdBoltNullBoltTopo 600
.\ConstSpoutIdBoltNullBoltTopo.yaml` with the following config
```
spout.count : 1
bolt1.count : 1 # IdBolt instances
bolt2.count : 1 # DevNullBolt instances
topology.workers : 3
```
This should force all messages to cross to another worker, which hopefully
does a good job showing any decrease in throughput for the changed code.
## TVL
I didn't put much thought into setting the number of workers, or component
counts here, let me know if I should try rerunning with some other numbers.
Ran `./storm jar .\storm-loadgen-2.0.0-SNAPSHOT.jar
org.apache.storm.loadgen.ThroughputVsLatency --rate 90000 --spouts 4
--splitters 4 --counters 4 --reporter 'tsv:test.csv' -c topology.workers=4
--test-time 10`.
90k tuples is more than my machine can handle, so this is mostly to
demonstrate that Storm doesn't choke any worse than it did before.
Also ran `./storm jar .\storm-loadgen-2.0.0-SNAPSHOT.jar
org.apache.storm.loadgen.ThroughputVsLatency --rate 75000 --spouts 4
--splitters 4 --counters 4 --reporter 'tsv:test.csv' -c topology.workers=4
--test-time 10`
75k is right on the edge of what my computer can put through, so it shows
what latencies look like compared to master.
Performance looks pretty much the same to me pre- and post these changes.
Let me know if there are other tests I should run to validate these changes.
# To do at some point
* ByteBufs are reference counted, so we should validate that we don't leak
references when testing. Netty can tell us when a leak occurs, but I haven't
set anything up to fail the build if these logs occur during testing yet
https://netty.io/wiki/reference-counted-objects.html#leak-detection-levels. I
did run some tests with the detection level set to paranoid, and didn't see any
leaks.
* Didn't really test Pacemaker
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/srdo/storm STORM-1038-clean
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/storm/pull/2704.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #2704
----
commit 20db8c339cbabefab1fe4cccb74d68c87cc36264
Author: Hang Sun <hsun@...>
Date: 2018-05-11T13:57:47Z
STORM-1038: Upgrade to Netty 4.x. See
https://github.com/apache/storm/pull/728 for the original contribution.
commit 2d0f96102e20ca1c89fb65d05a860456c9d10a96
Author: Stig Rohde Døssing <srdo@...>
Date: 2018-05-11T14:55:35Z
STORM-1038: Upgrade to latest Netty
commit 5fbe93339061c539ae20e06c7158f892ba4abb3c
Author: Stig Rohde Døssing <srdo@...>
Date: 2018-06-05T14:32:06Z
Refactor of transfer drainer
----
---