[DISCUSS] Do we need to have our own engine

JUN GAO Fri, 27 May 2022 03:07:10 -0700

Why do we need the SeaTunnel Engine, And what problems do we want to solve?

- *Better resource utilization rate*

Real time data synchronization is an important user scenario. Sometimes we
need real time synchronization of a full database. Now, Some common data
synchronization engine practices are one job per table. The advantage of
this practice is that one job failure does not influence another one. But
this practice will cause more waste of resources when most of the tables
only have a small amount of data.

We hope the SeaTunnel Engine can solve this problem. We plan to support a
more flexible resource share strategy. It will allow some jobs to share the
resources when they submit by the same user. Users can even specify which
jobs share resources between them. If anyone has an idea, welcome to
discuss in the mail list or github issue.

- *Fewer database connectors*

Another common problem in full database synchronization use CDC is each
table needs a database connector. This will put a lot of pressure on the db
server when there are a lot of tables in the database.

Can we design the database connectors as a shared resource between jobs?
users can configure their database connectors pool. When a job uses the
connector pool, SeaTunnel Engine will init the connector pool at the node
which the source/sink connector at. And then push the connector pool in the
source/sink connector. With the feature of Better resource utilization rate
<https://docs.google.com/document/d/e/2PACX-1vR5fJ-8sH03DpMHJd1oZ6CHwBtqfk9QESdQYoJyiF2QuGnuPM1a3lmu8m9NhGrUTvkYRSNcBWbSuX_G/pub#h.hlnmzqjxexv8>,
we can reduce the number of database connections to an acceptable range.

Another way to reduce database connectors used by CDC Source Connector is
to make multiple table read support in CDC Source Connector. And then the
stream will be split by table name in the SeaTunnel Engine.

This way reduces database connectors used by CDC Source Connector but it
can not reduce the database connectors used by sink if the synchronization
target is database too. So a shared database connector pool will be a good
way to solve it.

- *Data Cache between Source and Sink*

Flume is an excellent data synchronization project. Flume Channel can cache
data

when the sink fails and can not write data. This is useful in some scenarios.
For example, some users have limited time to save their database logs. CDC
Source Connector must ensure it can read database logs even if sink can not
write data.

A feasible solution is to start two jobs. One job uses CDC Source
Connector to read database logs and then use Kafka Sink Connector to write
data to kafka. And another job uses Kafka Source Connector to read data
from kafka and then use the target Sink Connector to write data to the
target. This solution needs the user to have a deep understanding of
low-level technology, And two jobs will increase the difficulty of
operation and maintenance. Because every job needs a JobMaster, So it will
need more resources.

Ideally, users only know they will read data from source and write data to
the sink and at the same time, in this process, the data can be cached in
case the sink fails. The synchronization engine needs to auto add cache
operation to the execution plan and ensure the source can work even if the
sink fails. In this process, the engine needs to ensure the data written to
the cache and read from the cache is transactional, this can ensure the
consistency of data.

The execution plan like this:

- *Schema Evolution*

Schema evolution is a feature that allows users to easily change a table’s
current schema to accommodate data that is changing over time. Most
commonly, it’s used when performing an append or overwrite operation, to
automatically adapt the schema to include one or more new columns.

This feature is required in real-time data warehouse scenarios. Currently,
flink and spark engines do not support this feature.

- *Finer fault tolerance*

At present, most real-time processing engines will make the job fail when
one of the tasks is failed. The main reason is that the downstream operator
depends on the calculation results of the upstream operator. However, in
the scenario of data synchronization, the data is simply read from the
source and then written to sink. It does not need to save the intermediate
result state. Therefore, the failure of one task will not affect whether
the results of other tasks are correct.

The new engine should provide more sophisticated fault-tolerant management.
It should support the failure of a single task without affecting the
execution of other tasks. It should provide an interface so that users can
manually retry failed tasks instead of retrying the entire job.

- *Speed Control*

In Batch jobs, we need support speed control. Let users choose the
synchronization speed they want to prevent too much impact on the source or
target database.

*More Information*

I make a simple design about SeaTunnel Engine. You can learn more details
in the following documents.

https://docs.google.com/document/d/e/2PACX-1vR5fJ-8sH03DpMHJd1oZ6CHwBtqfk9QESdQYoJyiF2QuGnuPM1a3lmu8m9NhGrUTvkYRSNcBWbSuX_G/pub

Best Regards

------------

Apache DolphinScheduler PMC

Jun Gao
[email protected]

[DISCUSS] Do we need to have our own engine

Reply via email to