[GitHub] [incubator-seatunnel] ashulin opened a new issue, #1608: [Umbrella][Feature][Core] Decoupling connectors from compute engines

GitBox Thu, 11 Aug 2022 19:41:08 -0700


ashulin opened a new issue, #1608:
URL: https://github.com/apache/incubator-seatunnel/issues/1608


   ### Search before asking
   
   - [X] I had searched in the 
[feature](https://github.com/apache/incubator-seatunnel/issues?q=is%3Aissue+label%3A%22Feature%22)
 and found no similar feature requirement.
   
   
   ### Description
   
   In the current implementation of SeaTunnel, the connector is coupled with 
the computing engine.
   so implementing a connector requires implementing the interface of each 
computing engine.
   
   The detailed design 
doc：https://docs.google.com/document/d/1h-BeGKK98VarSpYUfYkNtpYLVbYCh8PM_09N1pdq_aw/edit?usp=sharing
   
   ### Motivation
   1. A connector only needs to be implemented once and can be used on all 
engines;
   2. Supports multiple versions of Spark/Flink engines;
   3. Source interface to explicit partition/shard/split/parallel logic.
   4. Multiplexing JDBC/log connection.
   
   **Why not use Apache Beam?**
   The source of Apache Beam is divided into two categories: Unbounded and 
Bounded, which cannot achieve the purpose of one-time code;
   
   ### Overall Design
   
   ![SeaTunnel 
Framework](https://user-images.githubusercontent.com/36807946/162152969-c103a9a1-affe-4b15-94db-57be2678515a.png)
   
   -   **Catalog**：Metadata management, which can automatically discover the 
schema and other information of the structured database;
   -   **Catalog Storage**：Used to store metadata for unstructured storage 
engines (e.g. Kafka);
   
   -   **SQL**：
   
   -   **DataType**：Table Column Data Type；
   
   -   **Table API**：Used for context passing and SeaTunnel Source/Sink 
instantiation
   
   -   **Source API**：
        - Explicit partition/shard/split/parallel logic; 
        - Batch & Streaming Unification；
        - Multiplexing source connection；
   
   -   **Sink API**：
        - Distributed transaction; 
        - Aggregated commits；
   
   -   **Translation**：
        - Make the engine support the SeaTunnel connector. 
        - Convert data to Row inside the engine.
        - Data distribution after multiplexing.
   
   ### Simple Flow
   
   ![SeaTunnel 
Flow](https://user-images.githubusercontent.com/36807946/162159155-038b1bea-6d2c-4345-8076-f393c76ba168.png)
   
   **Why do we need multiplex connections**
   Streaming scene：
   - RDB (e.g. MySQL) may have too many connections errors or database pressure;
   - Duplicate parsing logs under change data capture (CDC) scenes  (e.g. MySQL 
binlog，Oracle Redolog);
   
   ### Simple Source & Sink Flow
   ![SeaTunnel Engine 
Flow](https://user-images.githubusercontent.com/36807946/163544249-2a0f99f8-55e5-42c5-9d12-cc4cdc9d541e.png)
   
   ### The subtasks: 
   - [x] #1701
   - [x] #1704
   - [x] #1734
   
   ### Are you willing to submit a PR?
   
   - [X] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [incubator-seatunnel] ashulin opened a new issue, #1608: [Umbrella][Feature][Core] Decoupling connectors from compute engines

Reply via email to