haruki-830 opened a new pull request, #4423:
URL: https://github.com/apache/flink-cdc/pull/4423

   #### Summary
   
   This commit introduces a configurable partitioning strategy for Flink CDC 
pipelines. Currently, the default PRIMARY_KEY hashing distributes events from 
the same table across multiple subtasks, which introduces unnecessary overhead 
for small tables or tables with no/changing primary keys. With this change, 
users can switch to TABLE_ID strategy via YAML configuration to route all 
events from the same table to a single subtask without code changes.
   
   #### Key Changes
   
   1. New HashFunctionStrategy Enum
   - Introduced HashFunctionStrategy enum in flink-cdc-common with two options: 
PRIMARY_KEY (hash by TableId + primary keys for load balancing) and TABLE_ID 
(hash by TableId only, routing same-table events to one subtask).
   - Designed with @PublicEvolving annotation, allowing future strategies like 
ROUND_ROBIN or COLUMNS.
   
   2. New TableIdHashFunctionProvider
   - Added TableIdHashFunctionProvider that computes hash based solely on 
TableId (namespace, schema, table), ignoring the record payload entirely.
   - Uses singleton pattern for HashFunction since it is stateless.
   - Suitable for small tables that don't need fine-grained load balancing, or 
tables with no/changing primary keys.
   
   3. Pipeline Configuration Option
   - Added pipeline option "partitioning.strategy" to allow switching 
strategies via YAML, without implementing custom HashFunctionProvider.
   - When unset, behavior is identical to current (falls back to sink-provided 
provider), ensuring full backward compatibility.
   
   4. Composer Wiring
   - Modified FlinkPipelineComposer to resolve HashFunctionProvider based on 
the pipeline configuration, applying to RegularPrePartitionOperator, 
BatchRegularPrePartitionOperator and DistributedPrePartitionOperator.
   
   5. Comprehensive Testing
   - Added TableIdHashFunctionProviderTest with 7 test cases covering 
same-table hashing, different-table distinction, namespace sensitivity, 
operation type independence, primary key value independence, and singleton 
instance verification.
   - Added PrePartitionOperatorTest case verifying TABLE_ID strategy routes 
events with different primary keys to the same subtask.
   - Added FlinkPipelineComposerTest configurations for PRIMARY_KEY and 
TABLE_ID strategy validation.
   
   #### Configuration Example
   
   ##### Route same-table events to a single subtask
   ```
   pipeline:
     name: my-cdc-job
     partitioning.strategy: TABLE_ID
   ```
   
   ##### Force load-balanced distribution by primary keys
   ```
   pipeline:
     name: my-cdc-job
     partitioning.strategy: PRIMARY_KEY
   ```
   
     #### JIRA Reference
   
     


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to