Re: [DISCUSS] Do we need to have our own engine

JUN GAO Sun, 05 Jun 2022 07:52:42 -0700

Hi, Leo65535
Connectors based on the api-draft branch will be able to run directly in
the new engine, just as they can run directly in flink/spark.


leo65535 <[email protected]> 于2022年5月30日周一 18:06写道：

>
>
> Hi gaojun,
>
>
> I think this is a good idea, I have a question that what is the difference
> between api-draft and engine branch?
>
>
> Best,
> Leo65535
> At 2022-05-27 18:06:29, "JUN GAO" <[email protected]> wrote:
> >Why do we need the SeaTunnel Engine, And what problems do we want to
> solve?
> >
> >
> >   - *Better resource utilization rate*
> >
> >Real time data synchronization is an important user scenario. Sometimes we
> >need real time synchronization of a full database. Now, Some common data
> >synchronization engine practices are one job per table. The advantage of
> >this practice is that one job failure does not influence another one. But
> >this practice will cause more waste of resources when most of the tables
> >only have a small amount of data.
> >
> >We hope the SeaTunnel Engine can solve this problem. We plan to support a
> >more flexible resource share strategy. It will allow some jobs to share
> the
> >resources when they submit by the same user. Users can even specify which
> >jobs share resources between them. If anyone has an idea, welcome to
> >discuss in the mail list or github issue.
> >
> >
> >   - *Fewer database connectors*
> >
> >Another common problem in full database synchronization use CDC is each
> >table needs a database connector. This will put a lot of pressure on the
> db
> >server when there are a lot of tables in the database.
> >
> >Can we design the database connectors as a shared resource between jobs?
> >users can configure their database connectors pool. When a job uses the
> >connector pool, SeaTunnel Engine will init the connector pool at the node
> >which the source/sink connector at. And then push the connector pool in
> the
> >source/sink connector. With the feature of  Better resource utilization
> rate
> ><
> https://docs.google.com/document/d/e/2PACX-1vR5fJ-8sH03DpMHJd1oZ6CHwBtqfk9QESdQYoJyiF2QuGnuPM1a3lmu8m9NhGrUTvkYRSNcBWbSuX_G/pub#h.hlnmzqjxexv8
> >,
> >we can reduce the number of database connections to an acceptable range.
> >
> >Another way to reduce database connectors used by CDC Source Connector is
> >to make multiple table read support in CDC Source Connector. And then the
> >stream will be split by table name in the SeaTunnel Engine.
> >
> >This way reduces database connectors used by CDC Source Connector but it
> >can not reduce the database connectors used by sink if the synchronization
> >target is database too. So a shared database connector pool will be a good
> >way to solve it.
> >
> >
> >   - *Data Cache between Source and Sink*
> >
> >
> >
> >Flume is an excellent data synchronization project. Flume Channel can
> cache
> >data
> >
> >when the sink fails and can not write data. This is useful in some
> scenarios.
> >For example, some users have limited time to save their database logs. CDC
> >Source Connector must ensure it can read database logs even if sink can
> not
> >write data.
> >
> >A feasible solution is to start two jobs.  One job uses CDC Source
> >Connector to read database logs and then use Kafka Sink Connector to write
> >data to kafka. And another job uses Kafka Source Connector to read data
> >from kafka and then use the target Sink Connector to write data to the
> >target. This solution needs the user to have a deep understanding of
> >low-level technology, And two jobs will increase the difficulty of
> >operation and maintenance. Because every job needs a JobMaster, So it will
> >need more resources.
> >
> >Ideally, users only know they will read data from source and write data to
> >the sink and at the same time, in this process, the data can be cached in
> >case the sink fails.  The synchronization engine needs to auto add cache
> >operation to the execution plan and ensure the source can work even if the
> >sink fails. In this process, the engine needs to ensure the data written
> to
> >the cache and read from the cache is transactional, this can ensure the
> >consistency of data.
> >
> >The execution plan like this:
> >
> >
> >   - *Schema Evolution*
> >
> >Schema evolution is a feature that allows users to easily change a table’s
> >current schema to accommodate data that is changing over time. Most
> >commonly, it’s used when performing an append or overwrite operation, to
> >automatically adapt the schema to include one or more new columns.
> >
> >This feature is required in real-time data warehouse scenarios. Currently,
> >flink and spark engines do not support this feature.
> >
> >
> >   - *Finer fault tolerance*
> >
> >At present, most real-time processing engines will make the job fail when
> >one of the tasks is failed. The main reason is that the downstream
> operator
> >depends on the calculation results of the upstream operator. However, in
> >the scenario of data synchronization, the data is simply read from the
> >source and then written to sink. It does not need to save the intermediate
> >result state. Therefore, the failure of one task will not affect whether
> >the results of other tasks are correct.
> >
> >The new engine should provide more sophisticated fault-tolerant
> management.
> >It should support the failure of a single task without affecting the
> >execution of other tasks. It should provide an interface so that users can
> >manually retry failed tasks instead of retrying the entire job.
> >
> >
> >   - *Speed Control*
> >
> >In Batch jobs, we need support speed control. Let users choose the
> >synchronization speed they want to prevent too much impact on the source
> or
> >target database.
> >
> >
> >
> >*More Information*
> >
> >
> >I make a simple design about SeaTunnel Engine.  You can learn more details
> >in the following documents.
> >
> >
> https://docs.google.com/document/d/e/2PACX-1vR5fJ-8sH03DpMHJd1oZ6CHwBtqfk9QESdQYoJyiF2QuGnuPM1a3lmu8m9NhGrUTvkYRSNcBWbSuX_G/pub
> >
> >
> >--
> >
> >Best Regards
> >
> >------------
> >
> >Apache DolphinScheduler PMC
> >
> >Jun Gao
> >[email protected]
>


-- 

Best Regards

------------

Apache DolphinScheduler PMC

Jun Gao
[email protected]

Re: [DISCUSS] Do we need to have our own engine

Reply via email to