Re:[DISCUSS] Do we need to have our own engine

leo65535 Mon, 30 May 2022 03:07:01 -0700


Hi gaojun,



I think this is a good idea, I have a question that what is the difference 
between api-draft and engine branch?


Best,
Leo65535
At 2022-05-27 18:06:29, "JUN GAO" <[email protected]> wrote:
>Why do we need the SeaTunnel Engine, And what problems do we want to solve?
>
>
>   - *Better resource utilization rate*
>
>Real time data synchronization is an important user scenario. Sometimes we
>need real time synchronization of a full database. Now, Some common data
>synchronization engine practices are one job per table. The advantage of
>this practice is that one job failure does not influence another one. But
>this practice will cause more waste of resources when most of the tables
>only have a small amount of data.
>
>We hope the SeaTunnel Engine can solve this problem. We plan to support a
>more flexible resource share strategy. It will allow some jobs to share the
>resources when they submit by the same user. Users can even specify which
>jobs share resources between them. If anyone has an idea, welcome to
>discuss in the mail list or github issue.
>
>
>   - *Fewer database connectors*
>
>Another common problem in full database synchronization use CDC is each
>table needs a database connector. This will put a lot of pressure on the db
>server when there are a lot of tables in the database.
>
>Can we design the database connectors as a shared resource between jobs?
>users can configure their database connectors pool. When a job uses the
>connector pool, SeaTunnel Engine will init the connector pool at the node
>which the source/sink connector at. And then push the connector pool in the
>source/sink connector. With the feature of  Better resource utilization rate
><https://docs.google.com/document/d/e/2PACX-1vR5fJ-8sH03DpMHJd1oZ6CHwBtqfk9QESdQYoJyiF2QuGnuPM1a3lmu8m9NhGrUTvkYRSNcBWbSuX_G/pub#h.hlnmzqjxexv8>,
>we can reduce the number of database connections to an acceptable range.
>
>Another way to reduce database connectors used by CDC Source Connector is
>to make multiple table read support in CDC Source Connector. And then the
>stream will be split by table name in the SeaTunnel Engine.
>
>This way reduces database connectors used by CDC Source Connector but it
>can not reduce the database connectors used by sink if the synchronization
>target is database too. So a shared database connector pool will be a good
>way to solve it.
>
>
>   - *Data Cache between Source and Sink*
>
>
>
>Flume is an excellent data synchronization project. Flume Channel can cache
>data
>
>when the sink fails and can not write data. This is useful in some scenarios.
>For example, some users have limited time to save their database logs. CDC
>Source Connector must ensure it can read database logs even if sink can not
>write data.
>
>A feasible solution is to start two jobs.  One job uses CDC Source
>Connector to read database logs and then use Kafka Sink Connector to write
>data to kafka. And another job uses Kafka Source Connector to read data
>from kafka and then use the target Sink Connector to write data to the
>target. This solution needs the user to have a deep understanding of
>low-level technology, And two jobs will increase the difficulty of
>operation and maintenance. Because every job needs a JobMaster, So it will
>need more resources.
>
>Ideally, users only know they will read data from source and write data to
>the sink and at the same time, in this process, the data can be cached in
>case the sink fails.  The synchronization engine needs to auto add cache
>operation to the execution plan and ensure the source can work even if the
>sink fails. In this process, the engine needs to ensure the data written to
>the cache and read from the cache is transactional, this can ensure the
>consistency of data.
>
>The execution plan like this:
>
>
>   - *Schema Evolution*
>
>Schema evolution is a feature that allows users to easily change a table’s
>current schema to accommodate data that is changing over time. Most
>commonly, it’s used when performing an append or overwrite operation, to
>automatically adapt the schema to include one or more new columns.
>
>This feature is required in real-time data warehouse scenarios. Currently,
>flink and spark engines do not support this feature.
>
>
>   - *Finer fault tolerance*
>
>At present, most real-time processing engines will make the job fail when
>one of the tasks is failed. The main reason is that the downstream operator
>depends on the calculation results of the upstream operator. However, in
>the scenario of data synchronization, the data is simply read from the
>source and then written to sink. It does not need to save the intermediate
>result state. Therefore, the failure of one task will not affect whether
>the results of other tasks are correct.
>
>The new engine should provide more sophisticated fault-tolerant management.
>It should support the failure of a single task without affecting the
>execution of other tasks. It should provide an interface so that users can
>manually retry failed tasks instead of retrying the entire job.
>
>
>   - *Speed Control*
>
>In Batch jobs, we need support speed control. Let users choose the
>synchronization speed they want to prevent too much impact on the source or
>target database.
>
>
>
>*More Information*
>
>
>I make a simple design about SeaTunnel Engine.  You can learn more details
>in the following documents.
>
>https://docs.google.com/document/d/e/2PACX-1vR5fJ-8sH03DpMHJd1oZ6CHwBtqfk9QESdQYoJyiF2QuGnuPM1a3lmu8m9NhGrUTvkYRSNcBWbSuX_G/pub
>
>
>-- 
>
>Best Regards
>
>------------
>
>Apache DolphinScheduler PMC
>
>Jun Gao
>[email protected]

Re:[DISCUSS] Do we need to have our own engine

Reply via email to