Re: [DISCUSS] Do we need to have our own engine

JUN GAO Sun, 05 Jun 2022 07:37:35 -0700

Hi, Kelu

1. The SeaTunnel-Engine is not designed to replace Flink/Spark，It will be the 
third engine supported by seatunnel in addition to flink/spark. As I described 
in my email, it is designed to better solve the problems in the data 
synchronization process, which are not well solved in spark and Flink.


2. I think if we just develop an engine for data synchronization, it won't be 
too complicated. DataX is a good data synchronization engine. In fact, it 
doesn't have much code. Admittedly, the new engine will be more complex than 
dataX. Because we need support cluster mode. However, as a data synchronization 
engine, we do not need to consider shuffle, aggregation, join and other 
operations, so it should be simpler than spark or Flink. I have completed a 
part of the validation code 
https://github.com/apache/incubator-seatunnel/pull/1948 
<https://github.com/apache/incubator-seatunnel/pull/1948> . I hope members of 
the community who are interested in the new engine can work with me to improve 
it.



Best Regards
---------------
Apache DolphinScheduler  PMC
Jun Gao 高俊
[email protected]
---------------

> 2022年5月30日 下午8:59，陶克路 <[email protected]> 写道：
> 
> Hi, gaojun, thanks for sharing.
> 
> I have some problems about the engine:
> 
>   1. the relationship with Flink/Spark. Our engine is designed to replace
>   Flink/Spark?
>   2. if designed to replace Flink/Spark, how to build the huge thing from
>   scratch?
>   3. If designed above Flink/Spark, how to achieve our goal without
>   modifying Flink/Spark code?
> 
> 
> Thanks,
> Kelu
> 
> On Fri, May 27, 2022 at 6:07 PM JUN GAO <[email protected] 
> <mailto:[email protected]>> wrote:
> 
>> Why do we need the SeaTunnel Engine, And what problems do we want to solve?
>> 
>> 
>>   - *Better resource utilization rate*
>> 
>> Real time data synchronization is an important user scenario. Sometimes we
>> need real time synchronization of a full database. Now, Some common data
>> synchronization engine practices are one job per table. The advantage of
>> this practice is that one job failure does not influence another one. But
>> this practice will cause more waste of resources when most of the tables
>> only have a small amount of data.
>> 
>> We hope the SeaTunnel Engine can solve this problem. We plan to support a
>> more flexible resource share strategy. It will allow some jobs to share the
>> resources when they submit by the same user. Users can even specify which
>> jobs share resources between them. If anyone has an idea, welcome to
>> discuss in the mail list or github issue.
>> 
>> 
>>   - *Fewer database connectors*
>> 
>> Another common problem in full database synchronization use CDC is each
>> table needs a database connector. This will put a lot of pressure on the db
>> server when there are a lot of tables in the database.
>> 
>> Can we design the database connectors as a shared resource between jobs?
>> users can configure their database connectors pool. When a job uses the
>> connector pool, SeaTunnel Engine will init the connector pool at the node
>> which the source/sink connector at. And then push the connector pool in the
>> source/sink connector. With the feature of  Better resource utilization
>> rate
>> <
>> https://docs.google.com/document/d/e/2PACX-1vR5fJ-8sH03DpMHJd1oZ6CHwBtqfk9QESdQYoJyiF2QuGnuPM1a3lmu8m9NhGrUTvkYRSNcBWbSuX_G/pub#h.hlnmzqjxexv8
>>  
>> <https://docs.google.com/document/d/e/2PACX-1vR5fJ-8sH03DpMHJd1oZ6CHwBtqfk9QESdQYoJyiF2QuGnuPM1a3lmu8m9NhGrUTvkYRSNcBWbSuX_G/pub#h.hlnmzqjxexv8>
>>> ,
>> we can reduce the number of database connections to an acceptable range.
>> 
>> Another way to reduce database connectors used by CDC Source Connector is
>> to make multiple table read support in CDC Source Connector. And then the
>> stream will be split by table name in the SeaTunnel Engine.
>> 
>> This way reduces database connectors used by CDC Source Connector but it
>> can not reduce the database connectors used by sink if the synchronization
>> target is database too. So a shared database connector pool will be a good
>> way to solve it.
>> 
>> 
>>   - *Data Cache between Source and Sink*
>> 
>> 
>> 
>> Flume is an excellent data synchronization project. Flume Channel can cache
>> data
>> 
>> when the sink fails and can not write data. This is useful in some
>> scenarios.
>> For example, some users have limited time to save their database logs. CDC
>> Source Connector must ensure it can read database logs even if sink can not
>> write data.
>> 
>> A feasible solution is to start two jobs.  One job uses CDC Source
>> Connector to read database logs and then use Kafka Sink Connector to write
>> data to kafka. And another job uses Kafka Source Connector to read data
>> from kafka and then use the target Sink Connector to write data to the
>> target. This solution needs the user to have a deep understanding of
>> low-level technology, And two jobs will increase the difficulty of
>> operation and maintenance. Because every job needs a JobMaster, So it will
>> need more resources.
>> 
>> Ideally, users only know they will read data from source and write data to
>> the sink and at the same time, in this process, the data can be cached in
>> case the sink fails.  The synchronization engine needs to auto add cache
>> operation to the execution plan and ensure the source can work even if the
>> sink fails. In this process, the engine needs to ensure the data written to
>> the cache and read from the cache is transactional, this can ensure the
>> consistency of data.
>> 
>> The execution plan like this:
>> 
>> 
>>   - *Schema Evolution*
>> 
>> Schema evolution is a feature that allows users to easily change a table’s
>> current schema to accommodate data that is changing over time. Most
>> commonly, it’s used when performing an append or overwrite operation, to
>> automatically adapt the schema to include one or more new columns.
>> 
>> This feature is required in real-time data warehouse scenarios. Currently,
>> flink and spark engines do not support this feature.
>> 
>> 
>>   - *Finer fault tolerance*
>> 
>> At present, most real-time processing engines will make the job fail when
>> one of the tasks is failed. The main reason is that the downstream operator
>> depends on the calculation results of the upstream operator. However, in
>> the scenario of data synchronization, the data is simply read from the
>> source and then written to sink. It does not need to save the intermediate
>> result state. Therefore, the failure of one task will not affect whether
>> the results of other tasks are correct.
>> 
>> The new engine should provide more sophisticated fault-tolerant management.
>> It should support the failure of a single task without affecting the
>> execution of other tasks. It should provide an interface so that users can
>> manually retry failed tasks instead of retrying the entire job.
>> 
>> 
>>   - *Speed Control*
>> 
>> In Batch jobs, we need support speed control. Let users choose the
>> synchronization speed they want to prevent too much impact on the source or
>> target database.
>> 
>> 
>> 
>> *More Information*
>> 
>> 
>> I make a simple design about SeaTunnel Engine.  You can learn more details
>> in the following documents.
>> 
>> 
>> https://docs.google.com/document/d/e/2PACX-1vR5fJ-8sH03DpMHJd1oZ6CHwBtqfk9QESdQYoJyiF2QuGnuPM1a3lmu8m9NhGrUTvkYRSNcBWbSuX_G/pub
>> 
>> 
>> --
>> 
>> Best Regards
>> 
>> ------------
>> 
>> Apache DolphinScheduler PMC
>> 
>> Jun Gao
>> [email protected]
>> 
> 
> 
> -- 
> 
> Hello, Find me here: www.legendtkl.com <http://www.legendtkl.com/>.

Re: [DISCUSS] Do we need to have our own engine

Reply via email to