taoran92 opened a new pull request, #4422:
URL: https://github.com/apache/flink-cdc/pull/4422

   # What is the purpose of the change
   
     This PR reduces high CPU usage in MySQL CDC source when synchronizing a 
large number of tables.
   
     In large-table scenarios, MySQL binlog event processing may repeatedly 
check whether the same TableId should be included by the configured table 
filters. The hot path goes through Debezium's table filter
     predicates, which rely on regex matching:
   
   java.util.regex.Matcher.match
   java.util.regex.Matcher.matches
   io.debezium.relational.RelationalTableFilters
   
io.debezium.connector.mysql.MySqlStreamingChangeEventSource.informAboutUnknownTableIfRequired
   
     When the table list is large or the regex patterns are complex, repeatedly 
evaluating the same table filter result can consume significant CPU and cause 
TaskManager CPU usage to stay close to 100%.
   
     This PR caches the table filter result by TableId after constructing the 
Debezium table filter. The cached filter preserves the existing semantics of 
the Debezium include filter and Flink CDC excludeTableList,
     while avoiding repeated regex evaluation for the same table.
   
     # Brief change log
   
     - Cache MySQL CDC table filter results by TableId in MySqlSourceConfig
     - Preserve existing include/exclude table filter semantics when using the 
cached filter
     - Add unit tests to verify repeated checks for the same table reuse the 
cached result
     - Add unit tests to verify excludeTableList behavior is unchanged
   
     # Verifying this change
   
     This change is verified by unit tests:
   
     - MySqlSourceConfigTest#testCachesTableFilterResults
     - MySqlSourceConfigTest#testTableFilterWithExcludeTableList
   
     Does this pull request potentially affect one of the following parts:
   
     - Dependencies (does it add or upgrade a dependency): no
     - The public API, i.e. is any changed class annotated with 
@Public(@PublicEvolving): no
     - The serializers: no
     - The runtime per-record code paths (performance sensitive): yes
     - Anything that affects deployment or recovery: no
   
     # Documentation
   
     Does this pull request introduce a new feature? no
   
     If yes, how is the feature documented? not applicable


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to