yuxiqian commented on PR #3818: URL: https://github.com/apache/flink-cdc/pull/3818#issuecomment-2563220029
Hi @lvyanquan, I have some concern about how BucketAssignOperator works with schema evolution strategy. 1) FlushEvent now works as a "pipeline drain indicator", which means after all data sink writer acknowledges them, there should't be any unhandled / uncommitted events flowing the whole pipeline, so SchemaRegistry could evolve downstream DB safely.  However, the bucket assigning strategy might break that assumption, where data sink writers might receive data change events with stale schema, even after external schema evolution processes have finished.  2) Potential schema operator hanging risks with "distributed" topology.  Basically same as described in #3680. In short, if a broadcast / custom partitioning topology is applied, then blocking one upstream partition will implicitly block all downstream partitions from handling events. 3) Why we need another Bucket assigner operator? AFAIK all data change events have been hashed & distributed in PartitionOperator. Since adding a BucketAssignOperator does not change the parallelism, is there any reason we can't calculate the correct bucket partition ID in advance, instead of creating another partitioning topology? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
