Hi gaojun,
I think this is a good idea, I have a question that what is the difference between api-draft and engine branch? Best, Leo65535 At 2022-05-27 18:06:29, "JUN GAO" <[email protected]> wrote: >Why do we need the SeaTunnel Engine, And what problems do we want to solve? > > > - *Better resource utilization rate* > >Real time data synchronization is an important user scenario. Sometimes we >need real time synchronization of a full database. Now, Some common data >synchronization engine practices are one job per table. The advantage of >this practice is that one job failure does not influence another one. But >this practice will cause more waste of resources when most of the tables >only have a small amount of data. > >We hope the SeaTunnel Engine can solve this problem. We plan to support a >more flexible resource share strategy. It will allow some jobs to share the >resources when they submit by the same user. Users can even specify which >jobs share resources between them. If anyone has an idea, welcome to >discuss in the mail list or github issue. > > > - *Fewer database connectors* > >Another common problem in full database synchronization use CDC is each >table needs a database connector. This will put a lot of pressure on the db >server when there are a lot of tables in the database. > >Can we design the database connectors as a shared resource between jobs? >users can configure their database connectors pool. When a job uses the >connector pool, SeaTunnel Engine will init the connector pool at the node >which the source/sink connector at. And then push the connector pool in the >source/sink connector. With the feature of Better resource utilization rate ><https://docs.google.com/document/d/e/2PACX-1vR5fJ-8sH03DpMHJd1oZ6CHwBtqfk9QESdQYoJyiF2QuGnuPM1a3lmu8m9NhGrUTvkYRSNcBWbSuX_G/pub#h.hlnmzqjxexv8>, >we can reduce the number of database connections to an acceptable range. > >Another way to reduce database connectors used by CDC Source Connector is >to make multiple table read support in CDC Source Connector. And then the >stream will be split by table name in the SeaTunnel Engine. > >This way reduces database connectors used by CDC Source Connector but it >can not reduce the database connectors used by sink if the synchronization >target is database too. So a shared database connector pool will be a good >way to solve it. > > > - *Data Cache between Source and Sink* > > > >Flume is an excellent data synchronization project. Flume Channel can cache >data > >when the sink fails and can not write data. This is useful in some scenarios. >For example, some users have limited time to save their database logs. CDC >Source Connector must ensure it can read database logs even if sink can not >write data. > >A feasible solution is to start two jobs. One job uses CDC Source >Connector to read database logs and then use Kafka Sink Connector to write >data to kafka. And another job uses Kafka Source Connector to read data >from kafka and then use the target Sink Connector to write data to the >target. This solution needs the user to have a deep understanding of >low-level technology, And two jobs will increase the difficulty of >operation and maintenance. Because every job needs a JobMaster, So it will >need more resources. > >Ideally, users only know they will read data from source and write data to >the sink and at the same time, in this process, the data can be cached in >case the sink fails. The synchronization engine needs to auto add cache >operation to the execution plan and ensure the source can work even if the >sink fails. In this process, the engine needs to ensure the data written to >the cache and read from the cache is transactional, this can ensure the >consistency of data. > >The execution plan like this: > > > - *Schema Evolution* > >Schema evolution is a feature that allows users to easily change a table’s >current schema to accommodate data that is changing over time. Most >commonly, it’s used when performing an append or overwrite operation, to >automatically adapt the schema to include one or more new columns. > >This feature is required in real-time data warehouse scenarios. Currently, >flink and spark engines do not support this feature. > > > - *Finer fault tolerance* > >At present, most real-time processing engines will make the job fail when >one of the tasks is failed. The main reason is that the downstream operator >depends on the calculation results of the upstream operator. However, in >the scenario of data synchronization, the data is simply read from the >source and then written to sink. It does not need to save the intermediate >result state. Therefore, the failure of one task will not affect whether >the results of other tasks are correct. > >The new engine should provide more sophisticated fault-tolerant management. >It should support the failure of a single task without affecting the >execution of other tasks. It should provide an interface so that users can >manually retry failed tasks instead of retrying the entire job. > > > - *Speed Control* > >In Batch jobs, we need support speed control. Let users choose the >synchronization speed they want to prevent too much impact on the source or >target database. > > > >*More Information* > > >I make a simple design about SeaTunnel Engine. You can learn more details >in the following documents. > >https://docs.google.com/document/d/e/2PACX-1vR5fJ-8sH03DpMHJd1oZ6CHwBtqfk9QESdQYoJyiF2QuGnuPM1a3lmu8m9NhGrUTvkYRSNcBWbSuX_G/pub > > >-- > >Best Regards > >------------ > >Apache DolphinScheduler PMC > >Jun Gao >[email protected]
