Ran Tao created FLINK-39824:
-------------------------------
Summary: [mysql-cdc] High CPU usage caused by repeated regex table
filtering in large table synchronization
Key: FLINK-39824
URL: https://issues.apache.org/jira/browse/FLINK-39824
Project: Flink
Issue Type: Bug
Components: Flink CDC
Reporter: Ran Tao
Attachments: 20260602-203545.jpg
*Description*
When using MySQL CDC pipeline source to synchronize a large number of tables,
TaskManager CPU usage can become very high and may stay close to 100%.
CPU profiling shows that most CPU time is spent in Java regex matching during
table filter evaluation:
{code:java}
java.util.regex.Matcher.match
java.util.regex.Matcher.matches
io.debezium.function.Predicates.lambda$matchedByPattern$5
io.debezium.relational.Selectors$TableSelectionPredicateBuilder...
io.debezium.relational.RelationalTableFilters...
io.debezium.connector.mysql.MySqlStreamingChangeEventSource.informAboutUnknownTableIfRequired
io.debezium.connector.mysql.MySqlStreamingChangeEventSource.handleUpdateTableMetadata
{code}
In large-table scenarios, the same TableId can be checked repeatedly during
binlog event processing. Each check currently goes through Debezium's
include/exclude table regex predicates again. If the table list
pattern is large or complex, this repeated regex evaluation may dominate CPU
usage.
*Expected Behavior*
MySQL CDC source should avoid repeatedly evaluating expensive regex table
filters for the same TableId. Once the include/exclude result for a table is
known, subsequent checks for the same table should reuse
the result.
*Actual Behavior*
The include/exclude table filter result is recomputed repeatedly through regex
matching, causing high CPU usage in large-scale table synchronization jobs.
*Impact*
This issue affects MySQL CDC jobs that synchronize many tables. It can cause:
- TaskManager CPU usage close to 100%
- Lower binlog processing throughput
- Increased CDC event latency
- Poor scalability for large table-list configurations
*Proposed Fix*
Cache the table filter result by TableId in MySqlSourceConfig.
The cached filter should preserve the existing behavior:
included by Debezium table filter
AND
not matched by exclude-table-list, if configured
This avoids repeated regex matching for the same table while keeping the
original include/exclude semantics unchanged.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)