Ran Tao created FLINK-39824:
-------------------------------

             Summary: [mysql-cdc] High CPU usage caused by repeated regex table 
filtering in large table synchronization
                 Key: FLINK-39824
                 URL: https://issues.apache.org/jira/browse/FLINK-39824
             Project: Flink
          Issue Type: Bug
          Components: Flink CDC
            Reporter: Ran Tao
         Attachments: 20260602-203545.jpg

*Description*

When using MySQL CDC pipeline source to synchronize a large number of tables, 
TaskManager CPU usage can become very high and may stay close to 100%.

CPU profiling shows that most CPU time is spent in Java regex matching during 
table filter evaluation:
{code:java}
java.util.regex.Matcher.match
java.util.regex.Matcher.matches
io.debezium.function.Predicates.lambda$matchedByPattern$5
io.debezium.relational.Selectors$TableSelectionPredicateBuilder...
io.debezium.relational.RelationalTableFilters...
io.debezium.connector.mysql.MySqlStreamingChangeEventSource.informAboutUnknownTableIfRequired
io.debezium.connector.mysql.MySqlStreamingChangeEventSource.handleUpdateTableMetadata
 {code}
In large-table scenarios, the same TableId can be checked repeatedly during 
binlog event processing. Each check currently goes through Debezium's 
include/exclude table regex predicates again. If the table list
pattern is large or complex, this repeated regex evaluation may dominate CPU 
usage.

*Expected Behavior*

MySQL CDC source should avoid repeatedly evaluating expensive regex table 
filters for the same TableId. Once the include/exclude result for a table is 
known, subsequent checks for the same table should reuse
the result.

*Actual Behavior*

The include/exclude table filter result is recomputed repeatedly through regex 
matching, causing high CPU usage in large-scale table synchronization jobs.

*Impact*

This issue affects MySQL CDC jobs that synchronize many tables. It can cause:
 - TaskManager CPU usage close to 100%
 - Lower binlog processing throughput
 - Increased CDC event latency
 - Poor scalability for large table-list configurations

*Proposed Fix*

Cache the table filter result by TableId in MySqlSourceConfig.

The cached filter should preserve the existing behavior:

included by Debezium table filter
AND
not matched by exclude-table-list, if configured

This avoids repeated regex matching for the same table while keeping the 
original include/exclude semantics unchanged.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to