[jira] [Comment Edited] (FLINK-18433) From the end-to-end performance test results, 1.11 has a regression

Piotr Nowojski (Jira) Mon, 29 Jun 2020 07:18:03 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-18433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17147739#comment-17147739
 ]


Piotr Nowojski edited comment on FLINK-18433 at 6/29/20, 2:16 PM:
------------------------------------------------------------------

[~AHeise], this metric comes from: 
{{org.apache.flink.runtime.io.network.api.writer.RecordWriter#numBuffersOut}} 
(if you do a text search through the code it might be hard to find because of 
the way how it's name is defined/constructed 
{{MetricNames#IO_NUM_BUFFERS_OUT_RATE = IO_NUM_BUFFERS_OUT + SUFFIX_RATE;}}).

Meta note [~Aihua], using number of buffers per second as a measurement of the 
throughput might be miss-leading. The actual records throughput might be the 
same or even better if we utilise single buffer more for whatever a reason (if 
the buffers we are sending are more full). Could you and/or [~AHeise] verify 
the actual throughput and double check if the number of buffers per second has 
indeed decreased? As [~AHeise] suggested it theoretically might also be because 
of some other measurement faults.

If that's the case, It might be interesting on its own to investigate why the 
buffers are more full. Maybe we have increased the overhead per buffer but 
thanks to low latency changes we are now sending more data per buffer 
(increasing latency a bit). 


was (Author: pnowojski):
[~AHeise], this metric comes from: 
{{org.apache.flink.runtime.io.network.api.writer.RecordWriter#numBuffersOut}} 
(if you do a text search through the code it might be hard to find because of 
the way how it's name is defined/constructed 
{{MetricNames#IO_NUM_BUFFERS_OUT_RATE = IO_NUM_BUFFERS_OUT + SUFFIX_RATE;}}.

Meta note [~Aihua], using number of buffers per second as a measurement of the 
throughput might be miss-leading. The actual records throughput might be the 
same or even better if we utilise single buffer more for whatever a reason (if 
the buffers we are sending are more full). Could you and/or [~AHeise] verify 
the actual throughput and double check if the number of buffers per second has 
indeed decreased? As [~AHeise] suggested it theoretically might also be because 
of some other measurement faults.

If that's the case, It might be interesting on its own to investigate why the 
buffers are more full. Maybe we have increased the overhead per buffer but 
thanks to low latency changes we are now sending more data per buffer 
(increasing latency a bit). 

> From the end-to-end performance test results, 1.11 has a regression
> -------------------------------------------------------------------
>
>                 Key: FLINK-18433
>                 URL: https://issues.apache.org/jira/browse/FLINK-18433
>             Project: Flink
>          Issue Type: Bug
>          Components: API / Core, API / DataStream
>    Affects Versions: 1.11.0
>         Environment: 3 machines
> [|https://github.com/Li-Aihua/flink/blob/test_suite_for_basic_operations_1.11/flink-end-to-end-perf-tests/flink-basic-operations/src/main/java/org/apache/flink/basic/operations/PerformanceTestJob.java]
>            Reporter: Aihua Li
>            Priority: Major
>         Attachments: flink_11.log.gz
>
>
>  
> I ran end-to-end performance tests between the Release-1.10 and Release-1.11. 
> the results were as follows:
> |scenarioName|release-1.10|release-1.11| |
> |OneInput_Broadcast_LazyFromSource_ExactlyOnce_10_rocksdb|46.175|43.81333333|-5.11%|
> |OneInput_Rescale_LazyFromSource_ExactlyOnce_100_heap|211.835|200.355|-5.42%|
> |OneInput_Rebalance_LazyFromSource_ExactlyOnce_1024_rocksdb|1721.041667|1618.323333|-5.97%|
> |OneInput_KeyBy_LazyFromSource_ExactlyOnce_10_heap|46|43.615|-5.18%|
> |OneInput_Broadcast_Eager_ExactlyOnce_100_rocksdb|212.105|199.6883333|-5.85%|
> |OneInput_Rescale_Eager_ExactlyOnce_1024_heap|1754.64|1600.123333|-8.81%|
> |OneInput_Rebalance_Eager_ExactlyOnce_10_rocksdb|45.91666667|43.09833333|-6.14%|
> |OneInput_KeyBy_Eager_ExactlyOnce_100_heap|212.0816667|200.7266667|-5.35%|
> |OneInput_Broadcast_LazyFromSource_AtLeastOnce_1024_rocksdb|1718.245|1614.381667|-6.04%|
> |OneInput_Rescale_LazyFromSource_AtLeastOnce_10_heap|46.12|43.55166667|-5.57%|
> |OneInput_Rebalance_LazyFromSource_AtLeastOnce_100_rocksdb|212.0383333|200.3883333|-5.49%|
> |OneInput_KeyBy_LazyFromSource_AtLeastOnce_1024_heap|1762.048333|1606.408333|-8.83%|
> |OneInput_Broadcast_Eager_AtLeastOnce_10_rocksdb|46.05833333|43.49666667|-5.56%|
> |OneInput_Rescale_Eager_AtLeastOnce_100_heap|212.2333333|201.1883333|-5.20%|
> |OneInput_Rebalance_Eager_AtLeastOnce_1024_rocksdb|1720.663333|1616.85|-6.03%|
> |OneInput_KeyBy_Eager_AtLeastOnce_10_heap|46.14|43.62333333|-5.45%|
> |TwoInputs_Broadcast_LazyFromSource_ExactlyOnce_100_rocksdb|156.9183333|152.9566667|-2.52%|
> |TwoInputs_Rescale_LazyFromSource_ExactlyOnce_1024_heap|1415.511667|1300.1|-8.15%|
> |TwoInputs_Rebalance_LazyFromSource_ExactlyOnce_10_rocksdb|34.29666667|34.16666667|-0.38%|
> |TwoInputs_KeyBy_LazyFromSource_ExactlyOnce_100_heap|158.3533333|151.8483333|-4.11%|
> |TwoInputs_Broadcast_Eager_ExactlyOnce_1024_rocksdb|1373.406667|1300.056667|-5.34%|
> |TwoInputs_Rescale_Eager_ExactlyOnce_10_heap|34.57166667|32.09666667|-7.16%|
> |TwoInputs_Rebalance_Eager_ExactlyOnce_100_rocksdb|158.655|147.44|-7.07%|
> |TwoInputs_KeyBy_Eager_ExactlyOnce_1024_heap|1356.611667|1292.386667|-4.73%|
> |TwoInputs_Broadcast_LazyFromSource_AtLeastOnce_10_rocksdb|34.01|33.205|-2.37%|
> |TwoInputs_Rescale_LazyFromSource_AtLeastOnce_100_heap|149.5883333|145.9966667|-2.40%|
> |TwoInputs_Rebalance_LazyFromSource_AtLeastOnce_1024_rocksdb|1359.74|1299.156667|-4.46%|
> |TwoInputs_KeyBy_LazyFromSource_AtLeastOnce_10_heap|34.025|29.68333333|-12.76%|
> |TwoInputs_Broadcast_Eager_AtLeastOnce_100_rocksdb|157.3033333|151.4616667|-3.71%|
> |TwoInputs_Rescale_Eager_AtLeastOnce_1024_heap|1368.74|1293.238333|-5.52%|
> |TwoInputs_Rebalance_Eager_AtLeastOnce_10_rocksdb|34.325|33.285|-3.03%|
> |TwoInputs_KeyBy_Eager_AtLeastOnce_100_heap|162.5116667|134.375|-17.31%|
> It can be seen that the performance of 1.11 has a regression, basically 
> around 5%, and the maximum regression is 17%. This needs to be checked.
> the test code:
> flink-1.10.0: 
> [https://github.com/Li-Aihua/flink/blob/test_suite_for_basic_operations/flink-end-to-end-perf-tests/flink-basic-operations/src/main/java/org/apache/flink/basic/operations/PerformanceTestJob.java]
> flink-1.11.0: 
> [https://github.com/Li-Aihua/flink/blob/test_suite_for_basic_operations_1.11/flink-end-to-end-perf-tests/flink-basic-operations/src/main/java/org/apache/flink/basic/operations/PerformanceTestJob.java]
> commit cmd like tis：
> bin/flink run -d -m 192.168.39.246:8081 -c 
> org.apache.flink.basic.operations.PerformanceTestJob 
> /home/admin/flink-basic-operations_2.11-1.10-SNAPSHOT.jar --topologyName 
> OneInput --LogicalAttributesofEdges Broadcast --ScheduleMode LazyFromSource 
> --CheckpointMode ExactlyOnce --recordSize 10 --stateBackend rocksdb
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (FLINK-18433) From the end-to-end performance test results, 1.11 has a regression

Reply via email to