Re: [PR] [FLINK-36794] [cdc-composer/cli] pipeline cdc connector support multiple data sources [flink-cdc]

via GitHub Wed, 08 Jan 2025 18:55:17 -0800


ChaomingZhangCN commented on code in PR #3844:
URL: https://github.com/apache/flink-cdc/pull/3844#discussion_r1908100724



##########
flink-cdc-cli/src/main/java/org/apache/flink/cdc/cli/parser/YamlPipelineDefinitionParser.java:
##########
@@ -97,6 +97,13 @@ public class YamlPipelineDefinitionParser implements 
PipelineDefinitionParser {
 
     public static final String TRANSFORM_TABLE_OPTION_KEY = "table-options";
 
+    private static final String HOST_LIST = "host_list";
+    private static final String COMMA = ",";
+    private static final String HOST_NAME = "hostname";
+    private static final String PORT = "port";
+    private static final String COLON = ":";
+    private static final String MUTIPLE = "_mutiple";

Review Comment:
   Should be `_multiple`.



##########
docs/content.zh/docs/connectors/pipeline-connectors/mysql.md:
##########
@@ -77,6 +77,32 @@ pipeline:
    parallelism: 4
 ```
 
+## 多数据源示例
+
+单数据源，从多个 MySQL 读取数据同步到 Doris 的 Pipeline 可以定义如下：
+
+```yaml
+source:
+   type: mysql_mutiple

Review Comment:
   Should we use a new key like 'sources' to describe multiple sources? The 
'_multiple' suffix in value seems a bit odd. Because the YAML content does not 
correspond one-to-one with the PipelineDef.



##########
flink-cdc-composer/src/main/java/org/apache/flink/cdc/composer/flink/FlinkPipelineComposer.java:
##########
@@ -126,16 +127,28 @@ private void translate(StreamExecutionEnvironment env, 
PipelineDef pipelineDef)
         // And required constructors
         OperatorIDGenerator schemaOperatorIDGenerator =
                 new 
OperatorIDGenerator(schemaOperatorTranslator.getSchemaOperatorUid());
-        DataSource dataSource =
-                sourceTranslator.createDataSource(pipelineDef.getSource(), 
pipelineDefConfig, env);
+        List<SourceDef> sourceDefs = pipelineDef.getSources();
+        //        DataSource dataSource =
+        //                sourceTranslator.createDataSource(sourceDefs, 
pipelineDefConfig, env);
         DataSink dataSink =
                 sinkTranslator.createDataSink(pipelineDef.getSink(), 
pipelineDefConfig, env);
-
-        boolean isParallelMetadataSource = 
dataSource.isParallelMetadataSource();
-
         // O ---> Source
-        DataStream<Event> stream =
-                sourceTranslator.translate(pipelineDef.getSource(), 
dataSource, env, parallelism);
+        DataStream<Event> stream = null;
+        DataSource dataSource = null;
+        for (SourceDef sourceDef : sourceDefs) {
+            dataSource = sourceTranslator.createDataSource(sourceDef, 
pipelineDefConfig, env);
+            DataStream<Event> streamBranch =
+                    sourceTranslator.translate(sourceDef, dataSource, env, 
parallelism);
+            if (stream == null) {
+                stream = streamBranch;
+            } else {
+                stream = stream.union(streamBranch);
+            }
+        }
+        boolean isParallelMetadataSource = 
dataSource.isParallelMetadataSource();

Review Comment:
   I think multi data sources should be regarded as parallelized.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [FLINK-36794] [cdc-composer/cli] pipeline cdc connector support multiple data sources [flink-cdc]

Reply via email to