[
https://issues.apache.org/jira/browse/FLINK-15103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16992220#comment-16992220
]
Xintong Song edited comment on FLINK-15103 at 12/10/19 8:44 AM:
----------------------------------------------------------------
Ok, I think I find something that might be the cause of the regression.
I’ll explain the details below, but maybe start from my conclusion: *With the
questioned commit we actually have more network buffers than before, which take
more time to allocate when initiating the task executors, causing the
regression.*
For simplicity, I’ll refer to the last commit before the regression
(6681e111f2cee580c86422082d4409004df4f096) with *before-regression*, and refer
to the commit causing the regression (4b8ed643a4d85c9440a8adbc0798b8a4bbd9520b)
with *cause-regression*. All the test results shown below are from running the
benchmarks on my laptop locally.
First, I tried to log out the number of network buffers and page sizes.
||before-regression||cause-regression||
|12945 buffers, page size 32kb|32768 buffers, page size 32kb|
The result shows that, cause-regression has more network buffers, which is as
expected.
*Why do we have more network buffers?* For the two commits discussed, network
buffer memory size is calculated in
{{NettyShuffleEnvironmentConfiguration#calculateNewNetworkBufferMemory}}.
Basically this method calculates network memory size with the following
equations.
{quote}networkMemorySize = heapAndManagedMemory * networkFraction / (1 -
networkFraction)
heapAndManagedMemory = jvmMaxHeapMemory, if managed memory is on-heap
heapAndManagedMemory = jvmMaxHeapMemory + managedMemory, if managed memory is
off-heap
{quote}
Assuming we have the same {{jvmMaxHeapMemory}}, with managed memory changed
from on-heap to off-heap, we should have larger network memory size, thus
larger number of network buffers when the page size stays the same.
What against intuition is that, with more network buffers, we actually have
worse performance. The only thing I can think of it that, our benchmarks take
the cluster initialization time into the statistics, and with more network
buffers we need more time for allocating those direct memory buffers.
To verify that, I explicitly configured
{{NettyShuffleEnvironmentOptions.NETWORK_NUM_BUFFERS}} in the benchmarks, and
logged out the time consumed by allocating all the buffers in the constructor
of {{NetworkBufferPool}}. Here are the results. Suffix '-small' / '-large'
stand for setting number of network buffers to 12945 / 32768 respectively.
||
||before-regression-small||before-regression-large||cause-regression-small||cause-regression-large||
|arrayKeyBy score (ops / ms)|1934.100|1835.320|2007.567|1966.123|
|tupleKeyBy score (ops / ms)|3403.841|3118.959|3696.727|3219.226|
|twoInputMapSink score (ops / ms) | 11025.369 | 10471.976 | 11511.51 |
10749.657 |
|globalWindow (ops / ms) |4399.537 |4063.925 | 4538.209 | 4020.957 |
|buffer allocation time (ms)|79.033|242.567|78.167|237.717|
The benchmark scores show that larger number of network buffers indeed leads to
the regression in statistics. Further dig into the results, taking
{{arrayKeyBy}} in before-regression as an example, the total records is
7,000,000 ({{KeyByBenchmarks#ARRAY_RECORDS_PER_INVOCATION}}), ops / ms for the
small network memory size case is 1934, that gives us the total execution time
of one invocation of {{KeyByBenchmarks#arrayKeyBy}} with the small network
buffer size is roughly 7,000,000 / 1934 = 3619ms. Similarly the total execution
time with the large network buffer size can be roughly calculated as 7,000,000
/ 1835 = 3815ms. The total execution time difference between with small / large
network buffer size is about 196ms, which is very close to time difference of
buffer allocation time (242 - 79 = 163ms). If we also take the randomness into
consideration, this basically explains where the benchmark regression come from.
I still need to look into the performance of the later commit
(9d1256ccbf8eb1556016b6805c3a91e2787d298a) that activates FLIP-49. Just to post
my findings so far.
was (Author: xintongsong):
Ok, I think I find something that might be the cause of the regression.
I’ll explain the details below, but maybe start from my conclusion: *With the
questioned commit we actually have more network buffers than before, which take
more time to allocate when initiating the task executors, causing the
regression.*
For simplicity, I’ll refer to the last commit before the regression
(6681e111f2cee580c86422082d4409004df4f096) with *before-regression*, and refer
to the commit causing the regression (4b8ed643a4d85c9440a8adbc0798b8a4bbd9520b)
with *cause-regression*. All the test results shown below are from running the
benchmarks on my laptop locally.
First, I tried to log out the number of network buffers and page sizes.
||before-regression||cause-regression||
|12945 buffers, page size 32kb|32768 buffers, page size 32kb|
The result shows that, cause-regression has more network buffers, which is as
expected.
*Why do we have more network buffers?* For the two commits discussed, network
buffer memory size is calculated in
{{NettyShuffleEnvironmentConfiguration#calculateNewNetworkBufferMemory}}.
Basically this method calculates network memory size with the following
equations.
{quote}networkMemorySize = heapAndManagedMemory * networkFraction / (1 -
networkFraction)
heapAndManagedMemory = jvmFreeHeapMemory, if managed memory is on-heap
heapAndManagedMemory = jvmFreeHeapMemory + managedMemory, if managed memory is
off-heap
{quote}
Assuming we have the same {{jvmFreeHeapMemory}}, with managed memory changed
from on-heap to off-heap, we should have larger network memory size, thus
larger number of network buffers when the page size stays the same.
What against intuition is that, with more network buffers, we actually have
worse performance. The only thing I can think of it that, our benchmarks take
the cluster initialization time into the statistics, and with more network
buffers we need more time for allocating those direct memory buffers.
To verify that, I explicitly configured
{{NettyShuffleEnvironmentOptions.NETWORK_NUM_BUFFERS}} in the benchmarks, and
logged out the time consumed by allocating all the buffers in the constructor
of {{NetworkBufferPool}}. Here are the results. Suffix '-small' / '-large'
stand for setting number of network buffers to 12945 / 32768 respectively.
||
||before-regression-small||before-regression-large||cause-regression-small||cause-regression-large||
|arrayKeyBy score (ops / ms)|1934.100|1835.320|2007.567|1966.123|
|tupleKeyBy score (ops / ms)|3403.841|3118.959|3696.727|3219.226|
|twoInputMapSink score (ops / ms) | 11025.369 | 10471.976 | 11511.51 |
10749.657 |
|globalWindow (ops / ms) |4399.537 |4063.925 | 4538.209 | 4020.957 |
|buffer allocation time (ms)|79.033|242.567|78.167|237.717|
The benchmark scores show that larger number of network buffers indeed leads to
the regression in statistics. Further dig into the results, taking
{{arrayKeyBy}} in before-regression as an example, the total records is
7,000,000 ({{KeyByBenchmarks#ARRAY_RECORDS_PER_INVOCATION}}), ops / ms for the
small network memory size case is 1934, that gives us the total execution time
of one invocation of {{KeyByBenchmarks#arrayKeyBy}} with the small network
buffer size is roughly 7,000,000 / 1934 = 3619ms. Similarly the total execution
time with the large network buffer size can be roughly calculated as 7,000,000
/ 1835 = 3815ms. The total execution time difference between with small / large
network buffer size is about 196ms, which is very close to time difference of
buffer allocation time (242 - 79 = 163ms). If we also take the randomness into
consideration, this basically explains where the benchmark regression come from.
I still need to look into the performance of the later commit
(9d1256ccbf8eb1556016b6805c3a91e2787d298a) that activates FLIP-49. Just to post
my findings so far.
> Performance regression on 3.12.2019 in various benchmarks
> ---------------------------------------------------------
>
> Key: FLINK-15103
> URL: https://issues.apache.org/jira/browse/FLINK-15103
> Project: Flink
> Issue Type: Bug
> Components: Benchmarks
> Reporter: Piotr Nowojski
> Priority: Blocker
> Fix For: 1.10.0
>
>
> Various benchmarks show a performance regression that happened on December
> 3rd:
> [arrayKeyBy (probably the most easily
> visible)|http://codespeed.dak8s.net:8000/timeline/#/?exe=1&ben=arrayKeyBy&env=2&revs=200&equid=off&quarts=on&extr=on]
>
> [tupleKeyBy|http://codespeed.dak8s.net:8000/timeline/#/?exe=1&ben=tupleKeyBy&env=2&revs=200&equid=off&quarts=on&extr=on]
>
> [twoInputMapSink|http://codespeed.dak8s.net:8000/timeline/#/?exe=1&ben=twoInputMapSink&env=2&revs=200&equid=off&quarts=on&extr=on]
> [globalWindow (small
> one)|http://codespeed.dak8s.net:8000/timeline/#/?exe=1&ben=globalWindow&env=2&revs=200&equid=off&quarts=on&extr=on]
> and possible others.
> Probably somewhere between those commits: -8403fd4- 2d67ee0..60b3f2f
--
This message was sent by Atlassian Jira
(v8.3.4#803005)