haruki-830 opened a new pull request, #4423:
URL: https://github.com/apache/flink-cdc/pull/4423
#### Summary
This commit introduces a configurable partitioning strategy for Flink CDC
pipelines. Currently, the default PRIMARY_KEY hashing distributes events from
the same table across multiple subtasks, which introduces unnecessary overhead
for small tables or tables with no/changing primary keys. With this change,
users can switch to TABLE_ID strategy via YAML configuration to route all
events from the same table to a single subtask without code changes.
#### Key Changes
1. New HashFunctionStrategy Enum
- Introduced HashFunctionStrategy enum in flink-cdc-common with two options:
PRIMARY_KEY (hash by TableId + primary keys for load balancing) and TABLE_ID
(hash by TableId only, routing same-table events to one subtask).
- Designed with @PublicEvolving annotation, allowing future strategies like
ROUND_ROBIN or COLUMNS.
2. New TableIdHashFunctionProvider
- Added TableIdHashFunctionProvider that computes hash based solely on
TableId (namespace, schema, table), ignoring the record payload entirely.
- Uses singleton pattern for HashFunction since it is stateless.
- Suitable for small tables that don't need fine-grained load balancing, or
tables with no/changing primary keys.
3. Pipeline Configuration Option
- Added pipeline option "partitioning.strategy" to allow switching
strategies via YAML, without implementing custom HashFunctionProvider.
- When unset, behavior is identical to current (falls back to sink-provided
provider), ensuring full backward compatibility.
4. Composer Wiring
- Modified FlinkPipelineComposer to resolve HashFunctionProvider based on
the pipeline configuration, applying to RegularPrePartitionOperator,
BatchRegularPrePartitionOperator and DistributedPrePartitionOperator.
5. Comprehensive Testing
- Added TableIdHashFunctionProviderTest with 7 test cases covering
same-table hashing, different-table distinction, namespace sensitivity,
operation type independence, primary key value independence, and singleton
instance verification.
- Added PrePartitionOperatorTest case verifying TABLE_ID strategy routes
events with different primary keys to the same subtask.
- Added FlinkPipelineComposerTest configurations for PRIMARY_KEY and
TABLE_ID strategy validation.
#### Configuration Example
##### Route same-table events to a single subtask
```
pipeline:
name: my-cdc-job
partitioning.strategy: TABLE_ID
```
##### Force load-balanced distribution by primary keys
```
pipeline:
name: my-cdc-job
partitioning.strategy: PRIMARY_KEY
```
#### JIRA Reference
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]