Hi Leonard and Yun,

Thanks to Yun Gao for the reminder.

We quickly re-run the test q18 in Nexmark for Flink 1.16.1 and Flink 1.17
with configuration
`execution.checkpointing.checkpoints-after-tasks-finish.enabled: false`.
The last few metrics of cpu usage behaved normally, which proves that the
parameters works.

With the configuration, we observed that:
> The test results of flink 1.17 have improved (Throughput/Cores: 58.85 K/s
-> 59.27 K/s).
> Compared with Flink 1.16.1, Flink 1.17 also performs better
(Throughput/Cores: 57.19 K/s v.s. 59.27 K/s).
> But compare with Flink 1.13, Flink 1.17 still has a performance
degradation (64.31 K/s -> 59.27 K/s).

Note that we are comparing to Flink 1.13 because we are using 1.13 as the
major version in our production environment.
And we'll move on to complete the rest of Nexmark's test query.


Here are some test result for you as reference.

Flink 1.17
Benchmark Queries: [q18]
==================================================================
Start to run query q18 with workload [tps=10 M, eventsNum=100 M,
percentage=bid:46,auction:3,person:1,kafkaServers:null]
Start the warmup for at most 120000ms and 100000000 events.
Stop the warmup, cost 120100ms.
Monitor metrics after 10 seconds.
Start to monitor metrics until job is finished.
Current Cores=18.62 (8 TMs)
Current Cores=16.42 (8 TMs)
Current Cores=12.49 (8 TMs)
Current Cores=12.92 (8 TMs)
Current Cores=13.59 (8 TMs)
Current Cores=14.88 (8 TMs)
Current Cores=13.46 (8 TMs)
Current Cores=12.99 (8 TMs)
Current Cores=11.99 (8 TMs)
Current Cores=16.27 (8 TMs)
Current Cores=15.43 (8 TMs)
Current Cores=14.43 (8 TMs)
Current Cores=16 (8 TMs)
Current Cores=11.78 (8 TMs)
Current Cores=12.97 (8 TMs)
Current Cores=11.17 (8 TMs)
Current Cores=15.51 (8 TMs)
Current Cores=15.85 (8 TMs)
Current Cores=13.81 (8 TMs)
Current Cores=15.26 (8 TMs)
Current Cores=13.52 (8 TMs)
Current Cores=12.48 (8 TMs)
Summary Average: EventsNum=100,000,000, Cores=14.17, Time=119.027 s
Stop job query q18
-------------------------------- Nexmark Results
--------------------------------

+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+
| Nexmark Query     | Events Num        | Cores             | Time(s)
     | Cores * Time(s)   | Throughput/Cores  |
+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+
|q18                |100,000,000        |14.17              |119.027
    |1687.157           |59.27 K/s          |
|Total              |100,000,000        |14.175             |119.027
    |1687.157           |59.27 K/s          |
+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+

Flink 1.16.1
Benchmark Queries: [q18]
==================================================================
Start to run query q18 with workload [tps=10 M, eventsNum=100 M,
percentage=bid:46,auction:3,person:1,kafkaServers:null]
Start the warmup for at most 120000ms and 100000000 events.
Stop the warmup, cost 120100ms.
Monitor metrics after 10 seconds.
Start to monitor metrics until job is finished.
Current Cores=16.34 (8 TMs)
Current Cores=17.97 (8 TMs)
Current Cores=15.22 (8 TMs)
Current Cores=14.3 (8 TMs)
Current Cores=15.79 (8 TMs)
Current Cores=17.48 (8 TMs)
Current Cores=13.9 (8 TMs)
Current Cores=12.49 (8 TMs)
Current Cores=13.98 (8 TMs)
Current Cores=15.27 (8 TMs)
Current Cores=15.8 (8 TMs)
Current Cores=15.46 (8 TMs)
Current Cores=15.53 (8 TMs)
Current Cores=12.4 (8 TMs)
Current Cores=12.91 (8 TMs)
Current Cores=11.59 (8 TMs)
Current Cores=8.06 (8 TMs)
Current Cores=11.9 (8 TMs)
Current Cores=14.88 (8 TMs)
Current Cores=15.19 (8 TMs)
Current Cores=15.11 (8 TMs)
Current Cores=12.59 (8 TMs)
Current Cores=10.83 (8 TMs)
Summary Average: EventsNum=100,000,000, Cores=14.13, Time=123.766 s
Stop job query q18
-------------------------------- Nexmark Results
--------------------------------

+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+
| Nexmark Query     | Events Num        | Cores             | Time(s)
     | Cores * Time(s)   | Throughput/Cores  |
+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+
|q18                |100,000,000        |14.13              |123.766
    |1748.705           |57.19 K/s          |
|Total              |100,000,000        |14.129             |123.766
    |1748.705           |57.19 K/s          |
+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+

Thanks.

Best
Yu Chen.

Yun Gao <yungao...@aliyun.com.invalid> 于2023年3月22日周三 18:18写道:

> Hi Yu,
> The waiting mentioned here should be introduced in [1][2] in 1.15
>  to fix the semantic issue of two-phase commit sinks.
> If it is not a concern to ensure all the data get committed
> for the bounded-inputs jobs, you may re-run the performance tests
> with final checkpoint disabled:
> execution.checkpointing.checkpoints-after-tasks-finish.enabled: false
> In the following versions we'll also try to trigger a checkpoint
> immediately
> after the sources finished to reduce the waiting time.
> Best,
> Yun Gao
> [1] https://issues.apache.org/jira/browse/FLINK-25105 <
> https://issues.apache.org/jira/browse/FLINK-25105 >
> [2] https://issues.apache.org/jira/browse/FLINK-25105 <
> https://issues.apache.org/jira/browse/FLINK-25105 >
> ------------------------------------------------------------------
> From:Leonard Xu <xbjt...@gmail.com>
> Send Time:2023 Mar. 22 (Wed.) 18:07
> To:dev <dev@flink.apache.org>
> Subject:Re: [VOTE] Release 1.17.0, release candidate #3
> Hi, Yu Chen
> > The test results show that Flink 1.17 has a significant performance
> > degradation compared to Flink 1.13 (About 8.49%), shall we need to
> identify
> > the reason for the performance degradation before Flink 1.17 is released?
> Thanks for the verification, I wonder why you compared to Flink 1.13
> instead of Flink 1.16, this is what you need to do in 1.17 release
> verification if you want to check the performance regression. Could you
> share the result between 1.16 and 1.17?
> From a technical point of view, this will not block 1.17 release, because
> 1.13 is a pretty old version, so it cannot be confirmed that it is a new
> issue introduced in 1.17. If you can provide the benchmark results of 1.16
> and 1.17, and confirm that the regression is introduced by 1.17, that will
> block this release.
> Best,
> Leonard
>

Reply via email to