[ 
https://issues.apache.org/jira/browse/FLINK-15103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16992220#comment-16992220
 ] 

Xintong Song edited comment on FLINK-15103 at 12/10/19 8:44 AM:
----------------------------------------------------------------

Ok, I think I find something that might be the cause of the regression.

I’ll explain the details below, but maybe start from my conclusion: *With the 
questioned commit we actually have more network buffers than before, which take 
more time to allocate when initiating the task executors, causing the 
regression.*

For simplicity, I’ll refer to the last commit before the regression 
(6681e111f2cee580c86422082d4409004df4f096) with *before-regression*, and refer 
to the commit causing the regression (4b8ed643a4d85c9440a8adbc0798b8a4bbd9520b) 
with *cause-regression*. All the test results shown below are from running the 
benchmarks on my laptop locally.

First, I tried to log out the number of network buffers and page sizes.
||before-regression||cause-regression||
|12945 buffers, page size 32kb|32768 buffers, page size 32kb|

The result shows that, cause-regression has more network buffers, which is as 
expected.

*Why do we have more network buffers?* For the two commits discussed, network 
buffer memory size is calculated in 
{{NettyShuffleEnvironmentConfiguration#calculateNewNetworkBufferMemory}}. 
Basically this method calculates network memory size with the following 
equations.
{quote}networkMemorySize = heapAndManagedMemory * networkFraction / (1 - 
networkFraction)
 heapAndManagedMemory = jvmMaxHeapMemory, if managed memory is on-heap
 heapAndManagedMemory = jvmMaxHeapMemory + managedMemory, if managed memory is 
off-heap
{quote}
Assuming we have the same {{jvmMaxHeapMemory}}, with managed memory changed 
from on-heap to off-heap, we should have larger network memory size, thus 
larger number of network buffers when the page size stays the same.

What against intuition is that, with more network buffers, we actually have 
worse performance. The only thing I can think of it that, our benchmarks take 
the cluster initialization time into the statistics, and with more network 
buffers we need more time for allocating those direct memory buffers.

To verify that, I explicitly configured 
{{NettyShuffleEnvironmentOptions.NETWORK_NUM_BUFFERS}} in the benchmarks, and 
logged out the time consumed by allocating all the buffers in the constructor 
of {{NetworkBufferPool}}. Here are the results. Suffix '-small' / '-large' 
stand for setting number of network buffers to 12945 / 32768 respectively.
|| 
||before-regression-small||before-regression-large||cause-regression-small||cause-regression-large||
|arrayKeyBy score (ops / ms)|1934.100|1835.320|2007.567|1966.123|
|tupleKeyBy score (ops / ms)|3403.841|3118.959|3696.727|3219.226|
|twoInputMapSink score (ops / ms) | 11025.369 | 10471.976 | 11511.51 | 
10749.657 |
|globalWindow (ops / ms) |4399.537 |4063.925 | 4538.209 | 4020.957 |
|buffer allocation time (ms)|79.033|242.567|78.167|237.717|

The benchmark scores show that larger number of network buffers indeed leads to 
the regression in statistics. Further dig into the results, taking 
{{arrayKeyBy}} in before-regression as an example, the total records is 
7,000,000 ({{KeyByBenchmarks#ARRAY_RECORDS_PER_INVOCATION}}), ops / ms for the 
small network memory size case is 1934, that gives us the total execution time 
of one invocation of {{KeyByBenchmarks#arrayKeyBy}} with the small network 
buffer size is roughly 7,000,000 / 1934 = 3619ms. Similarly the total execution 
time with the large network buffer size can be roughly calculated as 7,000,000 
/ 1835 = 3815ms. The total execution time difference between with small / large 
network buffer size is about 196ms, which is very close to time difference of 
buffer allocation time (242 - 79 = 163ms). If we also take the randomness into 
consideration, this basically explains where the benchmark regression come from.

I still need to look into the performance of the later commit 
(9d1256ccbf8eb1556016b6805c3a91e2787d298a) that activates FLIP-49. Just to post 
my findings so far.


was (Author: xintongsong):
Ok, I think I find something that might be the cause of the regression.

I’ll explain the details below, but maybe start from my conclusion: *With the 
questioned commit we actually have more network buffers than before, which take 
more time to allocate when initiating the task executors, causing the 
regression.*

For simplicity, I’ll refer to the last commit before the regression 
(6681e111f2cee580c86422082d4409004df4f096) with *before-regression*, and refer 
to the commit causing the regression (4b8ed643a4d85c9440a8adbc0798b8a4bbd9520b) 
with *cause-regression*. All the test results shown below are from running the 
benchmarks on my laptop locally.

First, I tried to log out the number of network buffers and page sizes.
||before-regression||cause-regression||
|12945 buffers, page size 32kb|32768 buffers, page size 32kb|

The result shows that, cause-regression has more network buffers, which is as 
expected.

*Why do we have more network buffers?* For the two commits discussed, network 
buffer memory size is calculated in 
{{NettyShuffleEnvironmentConfiguration#calculateNewNetworkBufferMemory}}. 
Basically this method calculates network memory size with the following 
equations.
{quote}networkMemorySize = heapAndManagedMemory * networkFraction / (1 - 
networkFraction)
 heapAndManagedMemory = jvmFreeHeapMemory, if managed memory is on-heap
 heapAndManagedMemory = jvmFreeHeapMemory + managedMemory, if managed memory is 
off-heap
{quote}
Assuming we have the same {{jvmFreeHeapMemory}}, with managed memory changed 
from on-heap to off-heap, we should have larger network memory size, thus 
larger number of network buffers when the page size stays the same.

What against intuition is that, with more network buffers, we actually have 
worse performance. The only thing I can think of it that, our benchmarks take 
the cluster initialization time into the statistics, and with more network 
buffers we need more time for allocating those direct memory buffers.

To verify that, I explicitly configured 
{{NettyShuffleEnvironmentOptions.NETWORK_NUM_BUFFERS}} in the benchmarks, and 
logged out the time consumed by allocating all the buffers in the constructor 
of {{NetworkBufferPool}}. Here are the results. Suffix '-small' / '-large' 
stand for setting number of network buffers to 12945 / 32768 respectively.
|| 
||before-regression-small||before-regression-large||cause-regression-small||cause-regression-large||
|arrayKeyBy score (ops / ms)|1934.100|1835.320|2007.567|1966.123|
|tupleKeyBy score (ops / ms)|3403.841|3118.959|3696.727|3219.226|
|twoInputMapSink score (ops / ms) | 11025.369 | 10471.976 | 11511.51 | 
10749.657 |
|globalWindow (ops / ms) |4399.537 |4063.925 | 4538.209 | 4020.957 |
|buffer allocation time (ms)|79.033|242.567|78.167|237.717|

The benchmark scores show that larger number of network buffers indeed leads to 
the regression in statistics. Further dig into the results, taking 
{{arrayKeyBy}} in before-regression as an example, the total records is 
7,000,000 ({{KeyByBenchmarks#ARRAY_RECORDS_PER_INVOCATION}}), ops / ms for the 
small network memory size case is 1934, that gives us the total execution time 
of one invocation of {{KeyByBenchmarks#arrayKeyBy}} with the small network 
buffer size is roughly 7,000,000 / 1934 = 3619ms. Similarly the total execution 
time with the large network buffer size can be roughly calculated as 7,000,000 
/ 1835 = 3815ms. The total execution time difference between with small / large 
network buffer size is about 196ms, which is very close to time difference of 
buffer allocation time (242 - 79 = 163ms). If we also take the randomness into 
consideration, this basically explains where the benchmark regression come from.

I still need to look into the performance of the later commit 
(9d1256ccbf8eb1556016b6805c3a91e2787d298a) that activates FLIP-49. Just to post 
my findings so far.

> Performance regression on 3.12.2019 in various benchmarks
> ---------------------------------------------------------
>
>                 Key: FLINK-15103
>                 URL: https://issues.apache.org/jira/browse/FLINK-15103
>             Project: Flink
>          Issue Type: Bug
>          Components: Benchmarks
>            Reporter: Piotr Nowojski
>            Priority: Blocker
>             Fix For: 1.10.0
>
>
> Various benchmarks show a performance regression that happened on December 
> 3rd:
> [arrayKeyBy (probably the most easily 
> visible)|http://codespeed.dak8s.net:8000/timeline/#/?exe=1&ben=arrayKeyBy&env=2&revs=200&equid=off&quarts=on&extr=on]
>  
> [tupleKeyBy|http://codespeed.dak8s.net:8000/timeline/#/?exe=1&ben=tupleKeyBy&env=2&revs=200&equid=off&quarts=on&extr=on]
>  
> [twoInputMapSink|http://codespeed.dak8s.net:8000/timeline/#/?exe=1&ben=twoInputMapSink&env=2&revs=200&equid=off&quarts=on&extr=on]
>  [globalWindow (small 
> one)|http://codespeed.dak8s.net:8000/timeline/#/?exe=1&ben=globalWindow&env=2&revs=200&equid=off&quarts=on&extr=on]
>  and possible others.
> Probably somewhere between those commits: -8403fd4- 2d67ee0..60b3f2f



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to