[
https://issues.apache.org/jira/browse/FLINK-35291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
LiuZeshan updated FLINK-35291:
------------------------------
Description:
We are doing performance testing on Flink cdc 3.0 and found through the arthas
profile that there is a significant performance bottleneck in the
deserialization of row data. The main problem lies in the String. format in the
BinaryRecordDataGenerator class, so we have made simple performance
optimizations.
test environment:
* flink: 1.20-SNAPSHOT master
* flink-cdc: 3.2-SNAPSHOT master
* 1CU minicluster mode
{code:java}
source:
type: mysql
hostname: localhost
port: 3308
username: root
password: 123456
tables: test.user_behavior
server-id: 5400-5404
#server-time-zone: UTC
scan.startup.mode: earliest-offset
debezium.poll.interval.ms: 10
sink:
type: values
name: Values Sink
materialized.in.memory: false
print.enabled: false
pipeline:
name: Sync MySQL Database to Values
parallelism: 1{code}
*before optimization: 3.5w/s*
!https://bytedance.larkoffice.com/space/api/box/stream/download/asynccode/?code=MTRjZGIyNWYyYmVlY2YwNDNmYjExZDE4MjRhMGYyYzlfcVRuM0JBYXpTem9qUWRxdkY0NGZmVkpWc1cxMnlzaE9fVG9rZW46RklTbWJUNkVYb2s0WGF4eEttWWN6M0hIbjJTXzE3MTQ5MjU4OTY6MTcxNDkyOTQ5Nl9WNA|width=361,height=179!
[^cdc-3.0-1c.html]
^Analyzing the flame chart, it can be found that approximately 24.45% of the
time is spent on string.format.^
!image-2024-05-06-00-29-34-618.png|width=583,height=171!
*after optimization: 5w/s*
!https://bytedance.larkoffice.com/space/api/box/stream/download/asynccode/?code=YjRkMDRmYTkzNzRiNjBmMzVmN2VlYTYyMGRmMGU0ZDRfcFIyNGNGMEViSzRjektpdVFWYTYyUnJQbWJjd1lnb3dfVG9rZW46V2ZXVGJ2T3lDb3dCSmF4WVZvTGMzc2h2bmpmXzE3MTQ5MjU5NTM6MTcxNDkyOTU1M19WNA|width=363,height=174!
[^cdc-3.0-1c-2.html]
After optimization, 4.7%(extractBeforeDataRecord+extractAfterDataRecord) of the
time is still spent on
org/apache/flink/cdc/runtime/typeutils/BinaryRecordDataGenerator.<init>.
Perhaps we can further optimize it.
!image-2024-05-06-00-37-16-028.png|width=379,height=107!
was:
We are doing performance testing on Flink cdc 3.0 and found through the arthas
profile that there is a significant performance bottleneck in the serialization
of row data. The main problem lies in the String. format in the
BinaryRecordDataGenerator class, so we have made simple performance
optimizations.
test environment:
* flink: 1.20-SNAPSHOT master
* flink-cdc: 3.2-SNAPSHOT master
* 1CU minicluster mode
{code:java}
source:
type: mysql
hostname: localhost
port: 3308
username: root
password: 123456
tables: test.user_behavior
server-id: 5400-5404
#server-time-zone: UTC
scan.startup.mode: earliest-offset
debezium.poll.interval.ms: 10
sink:
type: values
name: Values Sink
materialized.in.memory: false
print.enabled: false
pipeline:
name: Sync MySQL Database to Values
parallelism: 1{code}
*before optimization: 3.5w/s*
!https://bytedance.larkoffice.com/space/api/box/stream/download/asynccode/?code=MTRjZGIyNWYyYmVlY2YwNDNmYjExZDE4MjRhMGYyYzlfcVRuM0JBYXpTem9qUWRxdkY0NGZmVkpWc1cxMnlzaE9fVG9rZW46RklTbWJUNkVYb2s0WGF4eEttWWN6M0hIbjJTXzE3MTQ5MjU4OTY6MTcxNDkyOTQ5Nl9WNA|width=361,height=179!
[^cdc-3.0-1c.html]
^Analyzing the flame chart, it can be found that approximately 24.45% of the
time is spent on string.format.^
!image-2024-05-06-00-29-34-618.png|width=583,height=171!
*after optimization: 5w/s*
!https://bytedance.larkoffice.com/space/api/box/stream/download/asynccode/?code=YjRkMDRmYTkzNzRiNjBmMzVmN2VlYTYyMGRmMGU0ZDRfcFIyNGNGMEViSzRjektpdVFWYTYyUnJQbWJjd1lnb3dfVG9rZW46V2ZXVGJ2T3lDb3dCSmF4WVZvTGMzc2h2bmpmXzE3MTQ5MjU5NTM6MTcxNDkyOTU1M19WNA|width=363,height=174!
[^cdc-3.0-1c-2.html]
After optimization, 4.7%(extractBeforeDataRecord+extractAfterDataRecord) of the
time is still spent on
org/apache/flink/cdc/runtime/typeutils/BinaryRecordDataGenerator.<init>.
Perhaps we can further optimize it.
!image-2024-05-06-00-37-16-028.png|width=379,height=107!
> Improve the ROW data deserialization performance of
> DebeziumEventDeserializationScheme
> --------------------------------------------------------------------------------------
>
> Key: FLINK-35291
> URL: https://issues.apache.org/jira/browse/FLINK-35291
> Project: Flink
> Issue Type: Improvement
> Components: Flink CDC
> Affects Versions: 1.20.0
> Reporter: LiuZeshan
> Priority: Minor
> Labels: pull-request-available
> Fix For: 1.20.0
>
> Attachments: cdc-3.0-1c-2.html, cdc-3.0-1c.html,
> image-2024-05-06-00-29-34-618.png, image-2024-05-06-00-37-16-028.png
>
>
> We are doing performance testing on Flink cdc 3.0 and found through the
> arthas profile that there is a significant performance bottleneck in the
> deserialization of row data. The main problem lies in the String. format in
> the BinaryRecordDataGenerator class, so we have made simple performance
> optimizations.
> test environment:
> * flink: 1.20-SNAPSHOT master
> * flink-cdc: 3.2-SNAPSHOT master
> * 1CU minicluster mode
> {code:java}
> source:
> type: mysql
> hostname: localhost
> port: 3308
> username: root
> password: 123456
> tables: test.user_behavior
> server-id: 5400-5404
> #server-time-zone: UTC
> scan.startup.mode: earliest-offset
> debezium.poll.interval.ms: 10
> sink:
> type: values
> name: Values Sink
> materialized.in.memory: false
> print.enabled: false
> pipeline:
> name: Sync MySQL Database to Values
> parallelism: 1{code}
>
> *before optimization: 3.5w/s*
> !https://bytedance.larkoffice.com/space/api/box/stream/download/asynccode/?code=MTRjZGIyNWYyYmVlY2YwNDNmYjExZDE4MjRhMGYyYzlfcVRuM0JBYXpTem9qUWRxdkY0NGZmVkpWc1cxMnlzaE9fVG9rZW46RklTbWJUNkVYb2s0WGF4eEttWWN6M0hIbjJTXzE3MTQ5MjU4OTY6MTcxNDkyOTQ5Nl9WNA|width=361,height=179!
> [^cdc-3.0-1c.html]
> ^Analyzing the flame chart, it can be found that approximately 24.45% of the
> time is spent on string.format.^
> !image-2024-05-06-00-29-34-618.png|width=583,height=171!
>
> *after optimization: 5w/s*
> !https://bytedance.larkoffice.com/space/api/box/stream/download/asynccode/?code=YjRkMDRmYTkzNzRiNjBmMzVmN2VlYTYyMGRmMGU0ZDRfcFIyNGNGMEViSzRjektpdVFWYTYyUnJQbWJjd1lnb3dfVG9rZW46V2ZXVGJ2T3lDb3dCSmF4WVZvTGMzc2h2bmpmXzE3MTQ5MjU5NTM6MTcxNDkyOTU1M19WNA|width=363,height=174!
>
> [^cdc-3.0-1c-2.html]
> After optimization, 4.7%(extractBeforeDataRecord+extractAfterDataRecord) of
> the time is still spent on
> org/apache/flink/cdc/runtime/typeutils/BinaryRecordDataGenerator.<init>.
> Perhaps we can further optimize it.
> !image-2024-05-06-00-37-16-028.png|width=379,height=107!
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)