[PR] [FLINK-36763 / 36690][runtime] Support new "distributed" schema evolution topology & fix parallelized hang glitch [flink-cdc]

via GitHub Fri, 13 Dec 2024 02:14:00 -0800


yuxiqian opened a new pull request, #3801:
URL: https://github.com/apache/flink-cdc/pull/3801


   This closes FLINK-36763 and FLINK-36690.
   
   As explained in #3680, current pipeline design doesn't cooperate well with 
tables whose data and schema change events are distributed among different 
partitions, aka. [distributed 
tables](https://learn.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-distribute).
   
   Sadly, some data sources (like Kafka) are scatterly-distributed naturally, 
and could not be easily introduced into current pipeline framework.
   
   To resolve this issue while keep backwards compatibility, such changes have 
been made:
   
   1. Added another suit of `SchemaOperator` and `SchemaCoordinator` for 
distributed topology.
   
   * Previous operators are still in `schema.regular` package while new ones 
are located in `schema.distributed` package.
   * Common codes have been escalated into an abstract base class 
`SchemaRegistry` to reduce duplication.
   
   2. Added a new `@Experimental` optional method into `DataSource` to switch 
between two topologies.
   
   ```java
   @PublicEvolving
   public interface DataSource {
       // ...
   
       @Experimental
       default boolean canContainDistributedTables() {
           return false;
       }
   }
   ```
   
   Composer will detect data source's distribution trait to determine which 
operator topology to generate.
   
   3. Extracted schema merging utilities into `SchemaMergingUtils`, and 
deprecate corresponding functions in `SchemaUtils`.
   
   Now, schema merging is required in Transform, Routing, and Schema evolution 
stages. Sources that support schema inferencing might need it, too. Unifying 
them in one place would be easier to maintain.
   
   4. Updated migration test cases to cover CDC 3.2.0+ only.
   
   CDC 3.1.1 was released over 6 months ago. Keeping state compatibility with 
earlier versions is not really worthwhile.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] [FLINK-36763 / 36690][runtime] Support new "distributed" schema evolution topology & fix parallelized hang glitch [flink-cdc]

Reply via email to