Hi, Kelu 1. The SeaTunnel-Engine is not designed to replace Flink/Spark,It will be the third engine supported by seatunnel in addition to flink/spark. As I described in my email, it is designed to better solve the problems in the data synchronization process, which are not well solved in spark and Flink.
2. I think if we just develop an engine for data synchronization, it won't be too complicated. DataX is a good data synchronization engine. In fact, it doesn't have much code. Admittedly, the new engine will be more complex than dataX. Because we need support cluster mode. However, as a data synchronization engine, we do not need to consider shuffle, aggregation, join and other operations, so it should be simpler than spark or Flink. I have completed a part of the validation code https://github.com/apache/incubator-seatunnel/pull/1948 <https://github.com/apache/incubator-seatunnel/pull/1948> . I hope members of the community who are interested in the new engine can work with me to improve it. Best Regards --------------- Apache DolphinScheduler PMC Jun Gao 高俊 [email protected] --------------- > 2022年5月30日 下午8:59,陶克路 <[email protected]> 写道: > > Hi, gaojun, thanks for sharing. > > I have some problems about the engine: > > 1. the relationship with Flink/Spark. Our engine is designed to replace > Flink/Spark? > 2. if designed to replace Flink/Spark, how to build the huge thing from > scratch? > 3. If designed above Flink/Spark, how to achieve our goal without > modifying Flink/Spark code? > > > Thanks, > Kelu > > On Fri, May 27, 2022 at 6:07 PM JUN GAO <[email protected] > <mailto:[email protected]>> wrote: > >> Why do we need the SeaTunnel Engine, And what problems do we want to solve? >> >> >> - *Better resource utilization rate* >> >> Real time data synchronization is an important user scenario. Sometimes we >> need real time synchronization of a full database. Now, Some common data >> synchronization engine practices are one job per table. The advantage of >> this practice is that one job failure does not influence another one. But >> this practice will cause more waste of resources when most of the tables >> only have a small amount of data. >> >> We hope the SeaTunnel Engine can solve this problem. We plan to support a >> more flexible resource share strategy. It will allow some jobs to share the >> resources when they submit by the same user. Users can even specify which >> jobs share resources between them. If anyone has an idea, welcome to >> discuss in the mail list or github issue. >> >> >> - *Fewer database connectors* >> >> Another common problem in full database synchronization use CDC is each >> table needs a database connector. This will put a lot of pressure on the db >> server when there are a lot of tables in the database. >> >> Can we design the database connectors as a shared resource between jobs? >> users can configure their database connectors pool. When a job uses the >> connector pool, SeaTunnel Engine will init the connector pool at the node >> which the source/sink connector at. And then push the connector pool in the >> source/sink connector. With the feature of Better resource utilization >> rate >> < >> https://docs.google.com/document/d/e/2PACX-1vR5fJ-8sH03DpMHJd1oZ6CHwBtqfk9QESdQYoJyiF2QuGnuPM1a3lmu8m9NhGrUTvkYRSNcBWbSuX_G/pub#h.hlnmzqjxexv8 >> >> <https://docs.google.com/document/d/e/2PACX-1vR5fJ-8sH03DpMHJd1oZ6CHwBtqfk9QESdQYoJyiF2QuGnuPM1a3lmu8m9NhGrUTvkYRSNcBWbSuX_G/pub#h.hlnmzqjxexv8> >>> , >> we can reduce the number of database connections to an acceptable range. >> >> Another way to reduce database connectors used by CDC Source Connector is >> to make multiple table read support in CDC Source Connector. And then the >> stream will be split by table name in the SeaTunnel Engine. >> >> This way reduces database connectors used by CDC Source Connector but it >> can not reduce the database connectors used by sink if the synchronization >> target is database too. So a shared database connector pool will be a good >> way to solve it. >> >> >> - *Data Cache between Source and Sink* >> >> >> >> Flume is an excellent data synchronization project. Flume Channel can cache >> data >> >> when the sink fails and can not write data. This is useful in some >> scenarios. >> For example, some users have limited time to save their database logs. CDC >> Source Connector must ensure it can read database logs even if sink can not >> write data. >> >> A feasible solution is to start two jobs. One job uses CDC Source >> Connector to read database logs and then use Kafka Sink Connector to write >> data to kafka. And another job uses Kafka Source Connector to read data >> from kafka and then use the target Sink Connector to write data to the >> target. This solution needs the user to have a deep understanding of >> low-level technology, And two jobs will increase the difficulty of >> operation and maintenance. Because every job needs a JobMaster, So it will >> need more resources. >> >> Ideally, users only know they will read data from source and write data to >> the sink and at the same time, in this process, the data can be cached in >> case the sink fails. The synchronization engine needs to auto add cache >> operation to the execution plan and ensure the source can work even if the >> sink fails. In this process, the engine needs to ensure the data written to >> the cache and read from the cache is transactional, this can ensure the >> consistency of data. >> >> The execution plan like this: >> >> >> - *Schema Evolution* >> >> Schema evolution is a feature that allows users to easily change a table’s >> current schema to accommodate data that is changing over time. Most >> commonly, it’s used when performing an append or overwrite operation, to >> automatically adapt the schema to include one or more new columns. >> >> This feature is required in real-time data warehouse scenarios. Currently, >> flink and spark engines do not support this feature. >> >> >> - *Finer fault tolerance* >> >> At present, most real-time processing engines will make the job fail when >> one of the tasks is failed. The main reason is that the downstream operator >> depends on the calculation results of the upstream operator. However, in >> the scenario of data synchronization, the data is simply read from the >> source and then written to sink. It does not need to save the intermediate >> result state. Therefore, the failure of one task will not affect whether >> the results of other tasks are correct. >> >> The new engine should provide more sophisticated fault-tolerant management. >> It should support the failure of a single task without affecting the >> execution of other tasks. It should provide an interface so that users can >> manually retry failed tasks instead of retrying the entire job. >> >> >> - *Speed Control* >> >> In Batch jobs, we need support speed control. Let users choose the >> synchronization speed they want to prevent too much impact on the source or >> target database. >> >> >> >> *More Information* >> >> >> I make a simple design about SeaTunnel Engine. You can learn more details >> in the following documents. >> >> >> https://docs.google.com/document/d/e/2PACX-1vR5fJ-8sH03DpMHJd1oZ6CHwBtqfk9QESdQYoJyiF2QuGnuPM1a3lmu8m9NhGrUTvkYRSNcBWbSuX_G/pub >> >> >> -- >> >> Best Regards >> >> ------------ >> >> Apache DolphinScheduler PMC >> >> Jun Gao >> [email protected] >> > > > -- > > Hello, Find me here: www.legendtkl.com <http://www.legendtkl.com/>.
