Hi, Leo65535 Connectors based on the api-draft branch will be able to run directly in the new engine, just as they can run directly in flink/spark.
leo65535 <[email protected]> 于2022年5月30日周一 18:06写道: > > > Hi gaojun, > > > I think this is a good idea, I have a question that what is the difference > between api-draft and engine branch? > > > Best, > Leo65535 > At 2022-05-27 18:06:29, "JUN GAO" <[email protected]> wrote: > >Why do we need the SeaTunnel Engine, And what problems do we want to > solve? > > > > > > - *Better resource utilization rate* > > > >Real time data synchronization is an important user scenario. Sometimes we > >need real time synchronization of a full database. Now, Some common data > >synchronization engine practices are one job per table. The advantage of > >this practice is that one job failure does not influence another one. But > >this practice will cause more waste of resources when most of the tables > >only have a small amount of data. > > > >We hope the SeaTunnel Engine can solve this problem. We plan to support a > >more flexible resource share strategy. It will allow some jobs to share > the > >resources when they submit by the same user. Users can even specify which > >jobs share resources between them. If anyone has an idea, welcome to > >discuss in the mail list or github issue. > > > > > > - *Fewer database connectors* > > > >Another common problem in full database synchronization use CDC is each > >table needs a database connector. This will put a lot of pressure on the > db > >server when there are a lot of tables in the database. > > > >Can we design the database connectors as a shared resource between jobs? > >users can configure their database connectors pool. When a job uses the > >connector pool, SeaTunnel Engine will init the connector pool at the node > >which the source/sink connector at. And then push the connector pool in > the > >source/sink connector. With the feature of Better resource utilization > rate > >< > https://docs.google.com/document/d/e/2PACX-1vR5fJ-8sH03DpMHJd1oZ6CHwBtqfk9QESdQYoJyiF2QuGnuPM1a3lmu8m9NhGrUTvkYRSNcBWbSuX_G/pub#h.hlnmzqjxexv8 > >, > >we can reduce the number of database connections to an acceptable range. > > > >Another way to reduce database connectors used by CDC Source Connector is > >to make multiple table read support in CDC Source Connector. And then the > >stream will be split by table name in the SeaTunnel Engine. > > > >This way reduces database connectors used by CDC Source Connector but it > >can not reduce the database connectors used by sink if the synchronization > >target is database too. So a shared database connector pool will be a good > >way to solve it. > > > > > > - *Data Cache between Source and Sink* > > > > > > > >Flume is an excellent data synchronization project. Flume Channel can > cache > >data > > > >when the sink fails and can not write data. This is useful in some > scenarios. > >For example, some users have limited time to save their database logs. CDC > >Source Connector must ensure it can read database logs even if sink can > not > >write data. > > > >A feasible solution is to start two jobs. One job uses CDC Source > >Connector to read database logs and then use Kafka Sink Connector to write > >data to kafka. And another job uses Kafka Source Connector to read data > >from kafka and then use the target Sink Connector to write data to the > >target. This solution needs the user to have a deep understanding of > >low-level technology, And two jobs will increase the difficulty of > >operation and maintenance. Because every job needs a JobMaster, So it will > >need more resources. > > > >Ideally, users only know they will read data from source and write data to > >the sink and at the same time, in this process, the data can be cached in > >case the sink fails. The synchronization engine needs to auto add cache > >operation to the execution plan and ensure the source can work even if the > >sink fails. In this process, the engine needs to ensure the data written > to > >the cache and read from the cache is transactional, this can ensure the > >consistency of data. > > > >The execution plan like this: > > > > > > - *Schema Evolution* > > > >Schema evolution is a feature that allows users to easily change a table’s > >current schema to accommodate data that is changing over time. Most > >commonly, it’s used when performing an append or overwrite operation, to > >automatically adapt the schema to include one or more new columns. > > > >This feature is required in real-time data warehouse scenarios. Currently, > >flink and spark engines do not support this feature. > > > > > > - *Finer fault tolerance* > > > >At present, most real-time processing engines will make the job fail when > >one of the tasks is failed. The main reason is that the downstream > operator > >depends on the calculation results of the upstream operator. However, in > >the scenario of data synchronization, the data is simply read from the > >source and then written to sink. It does not need to save the intermediate > >result state. Therefore, the failure of one task will not affect whether > >the results of other tasks are correct. > > > >The new engine should provide more sophisticated fault-tolerant > management. > >It should support the failure of a single task without affecting the > >execution of other tasks. It should provide an interface so that users can > >manually retry failed tasks instead of retrying the entire job. > > > > > > - *Speed Control* > > > >In Batch jobs, we need support speed control. Let users choose the > >synchronization speed they want to prevent too much impact on the source > or > >target database. > > > > > > > >*More Information* > > > > > >I make a simple design about SeaTunnel Engine. You can learn more details > >in the following documents. > > > > > https://docs.google.com/document/d/e/2PACX-1vR5fJ-8sH03DpMHJd1oZ6CHwBtqfk9QESdQYoJyiF2QuGnuPM1a3lmu8m9NhGrUTvkYRSNcBWbSuX_G/pub > > > > > >-- > > > >Best Regards > > > >------------ > > > >Apache DolphinScheduler PMC > > > >Jun Gao > >[email protected] > -- Best Regards ------------ Apache DolphinScheduler PMC Jun Gao [email protected]
