[incubator-seatunnel-website] branch main updated: Add Blog (#180)

kirs Sun, 18 Dec 2022 23:32:42 -0800

This is an automated email from the ASF dual-hosted git repository.

kirs pushed a commit to branch main
in repository 
https://gitbox.apache.org/repos/asf/incubator-seatunnel-website.git



The following commit(s) were added to refs/heads/main by this push:
     new 188df8e1e4 Add Blog (#180)
188df8e1e4 is described below

commit 188df8e1e4faf0fbe946fc08bc4b142e3f8f4438
Author: lifeng <[email protected]>
AuthorDate: Mon Dec 19 15:32:29 2022 +0800

    Add Blog (#180)
    
    * Add Blog
    
    * update
---
 ...ter-analyzing-these-9-points-of-how-it-works.md | 304 +++++++++++++++++++++
 ...-IoTDB-to-implement-IoT-data-synchronization.md | 209 ++++++++++++++
 ...signed-for-tens-of-billions-data-integration.md | 168 ++++++++++++
 static/image/16714309762810/16714309876928.jpg     | Bin 0 -> 84511 bytes
 static/image/16714309762810/16714310892722.jpg     | Bin 0 -> 38299 bytes
 static/image/16714309762810/16714310916195.jpg     | Bin 0 -> 21969 bytes
 static/image/16714309762810/16714310939883.jpg     | Bin 0 -> 24425 bytes
 static/image/16714309762810/16714311670656.jpg     | Bin 0 -> 36180 bytes
 static/image/16714309762810/16714312426416.jpg     | Bin 0 -> 28519 bytes
 static/image/16714309762810/16714312637015.jpg     | Bin 0 -> 57136 bytes
 static/image/16714309762810/16714312769761.jpg     | Bin 0 -> 44875 bytes
 static/image/16714309762810/16714313100541.jpg     | Bin 0 -> 82883 bytes
 static/image/16714309762810/16714313301435.jpg     | Bin 0 -> 48643 bytes
 static/image/16714309762810/16714313559400.jpg     | Bin 0 -> 33701 bytes
 static/image/16714309762810/16714313919617.jpg     | Bin 0 -> 31755 bytes
 static/image/16714309762810/16714314318184.jpg     | Bin 0 -> 29638 bytes
 static/image/16714309762810/16714314489916.jpg     | Bin 0 -> 74265 bytes
 static/image/16714309762810/16714314658701.jpg     | Bin 0 -> 43890 bytes
 static/image/16714309762810/16714314806989.jpg     | Bin 0 -> 62131 bytes
 static/image/16714309762810/16714315001348.jpg     | Bin 0 -> 43282 bytes
 static/image/16714316310459/16714316482580.jpg     | Bin 0 -> 68668 bytes
 static/image/16714316310459/16714316839299.jpg     | Bin 0 -> 76913 bytes
 static/image/16714316310459/16714316928812.jpg     | Bin 0 -> 95133 bytes
 static/image/16714316310459/16714317022389.jpg     | Bin 0 -> 64049 bytes
 static/image/16714316310459/16714317067444.jpg     | Bin 0 -> 33412 bytes
 static/image/16714316310459/16714317166018.jpg     | Bin 0 -> 204281 bytes
 static/image/16714316310459/16714317203806.jpg     | Bin 0 -> 209742 bytes
 static/image/16714316310459/16714317262218.jpg     | Bin 0 -> 140892 bytes
 static/image/16714316310459/16714317512435.jpg     | Bin 0 -> 96059 bytes
 static/image/16714316310459/16714317679569.jpg     | Bin 0 -> 87184 bytes
 static/image/16714316310459/16714317930593.jpg     | Bin 0 -> 65519 bytes
 static/image/16714316310459/16714318216373.jpg     | Bin 0 -> 107898 bytes
 static/image/16714316310459/16714318381313.jpg     | Bin 0 -> 47056 bytes
 static/image/16714316310459/16714318550071.jpg     | Bin 0 -> 56231 bytes
 static/image/16714316310459/16714318796196.jpg     | Bin 0 -> 65959 bytes
 static/image/16714316310459/16714318873362.jpg     | Bin 0 -> 23316 bytes
 static/image/16714316310459/16714319281280.jpg     | Bin 0 -> 29749 bytes
 static/image/16714316310459/16714319372730.jpg     | Bin 0 -> 58595 bytes
 static/image/16714316310459/16714319569097.jpg     | Bin 0 -> 41398 bytes
 static/image/16714316310459/16714319599521.jpg     | Bin 0 -> 43831 bytes
 static/image/16714316310459/16714319862080.jpg     | Bin 0 -> 76262 bytes
 static/image/16714316310459/16714319949478.jpg     | Bin 0 -> 82030 bytes
 static/image/16714316310459/16714320117277.jpg     | Bin 0 -> 40771 bytes
 static/image/16714316310459/16714320201848.jpg     | Bin 0 -> 51662 bytes
 static/image/16714316310459/16714320394193.jpg     | Bin 0 -> 47429 bytes
 static/image/16714316310459/16714320856052.jpg     | Bin 0 -> 43899 bytes
 static/image/16714316310459/16714321290339.jpg     | Bin 0 -> 48896 bytes
 static/image/16714316310459/16714321480992.jpg     | Bin 0 -> 47814 bytes
 static/image/16714322747890/16714322908857.jpg     | Bin 0 -> 94498 bytes
 static/image/16714322747890/16714322944041.jpg     | Bin 0 -> 32656 bytes
 static/image/16714322747890/16714323322988.jpg     | Bin 0 -> 106431 bytes
 static/image/16714322747890/16714323668557.jpg     | Bin 0 -> 34628 bytes
 static/image/16714322747890/16714323741126.jpg     | Bin 0 -> 52776 bytes
 static/image/16714322747890/16714323846302.jpg     | Bin 0 -> 41958 bytes
 static/image/16714322747890/16714323920717.jpg     | Bin 0 -> 42363 bytes
 static/image/16714322747890/16714324136843.jpg     | Bin 0 -> 70919 bytes
 static/image/16714322747890/16714324237449.jpg     | Bin 0 -> 34695 bytes
 static/image/16714322747890/16714324336701.jpg     | Bin 0 -> 54465 bytes
 static/image/16714322747890/16714324454477.jpg     | Bin 0 -> 71137 bytes
 static/image/16714322747890/16714324726276.jpg     | Bin 0 -> 72596 bytes
 static/image/16714322747890/16714325444078.jpg     | Bin 0 -> 66426 bytes
 static/image/16714322747890/16714325474972.jpg     | Bin 0 -> 40127 bytes
 static/image/16714322747890/16714325600316.jpg     | Bin 0 -> 70028 bytes
 static/image/16714322747890/16714325681738.jpg     | Bin 0 -> 69049 bytes
 static/image/16714322747890/16714325887598.jpg     | Bin 0 -> 78541 bytes
 static/image/16714322747890/16714326227360.jpg     | Bin 0 -> 72323 bytes
 static/image/16714322747890/16714326749427.jpg     | Bin 0 -> 90849 bytes
 static/image/16714322747890/16714327002726.jpg     | Bin 0 -> 57446 bytes
 static/image/16714322747890/16714327218590.jpg     | Bin 0 -> 47709 bytes
 69 files changed, 681 insertions(+)

diff --git 
a/blog/2022-11-17-Mafengwo-finally-chose-Apache-SeaTunnel-after-analyzing-these-9-points-of-how-it-works.md
 
b/blog/2022-11-17-Mafengwo-finally-chose-Apache-SeaTunnel-after-analyzing-these-9-points-of-how-it-works.md
new file mode 100644
index 0000000000..27d01032b0
--- /dev/null
+++ 
b/blog/2022-11-17-Mafengwo-finally-chose-Apache-SeaTunnel-after-analyzing-these-9-points-of-how-it-works.md
@@ -0,0 +1,304 @@
+---
+slug: During the joint Apache SeaTunnel & IoTDB Meetup on October 15,
+title: Mafengwo finally chose Apache SeaTunnel after analyzing these 9 points 
of how it works!
+tags: [Meetup]
+---
+# Mafengwo finally chose Apache SeaTunnel after analyzing these 9 points of 
how it works!
+![](/image/16714322747890/16714322908857.jpg)
+
+![](/image/16714322747890/16714322944041.jpg)
+Bo Bi, data engineer at Mafengwo
+
+> During the joint Apache SeaTunnel & IoTDB Meetup on October 15, Bo Bi, the 
data engineer at a leading Chinese travel-social e-commerce platform Mafengwo, 
introduced the basic principles of SeaTunnel and related enterprise practice 
thinking, the pain points and optimization thinking in typical scenarios of 
Mafengwo’s big data development and scheduling platform, and shared his 
experience of participating in community contributions. We hope to help you 
understand SeaTunnel and the paths  [...]
+
+
+## Introduction to the technical principle of SeaTunnel
+SeaTunnel is a distributed, high-performance data integration platform for the 
synchronization and transformation of large volumes of data (offline and 
real-time)
+
+The diagram above shows the workflow of SeaTunnel, which in simple terms 
consists of 3 parts: input, transformation, and output; more complex data 
processing is just a combination of several actions.
+
+In a synchronization scenario, such as importing Kafka to Elasticsearch, Kafka 
is the Source of the process and Elasticsearch is the Sink of the process.
+
+If, during the import process, the field columns do not match the external 
data columns to be written and some column or type conversion is required, or 
if you need to join multiple data sources and then do some data widening, field 
expansion, etc., then you need to add some Transform in the process, 
corresponding to the middle part of the picture.
+
+![](/image/16714322747890/16714323322988.jpg)
+This shows that the core of SeaTunnel is the Source, Transform and Sink 
process definitions.
+
+In Source we can define the data sources we need to read, in Sink, we can 
define the data pipeline and eventually write the external storage, and we can 
transform the data in between, either using SQL or custom functions.
+
+## SeaTunnel Connector API Version V1 Architecture Breakdown
+For a mature component framework, there must be something unique about the 
design pattern of the API design implementation that makes the framework 
scalable.
+
+The SeaTunnel architecture consists of three main parts.
+
+1、SeaTunnel Basic API.
+
+1. the implementation of the SeaTunnel base API.
+
+2. SeaTunnel’s plug-in system.
+
+## SeaTunnel Basic API
+![](/image/16714322747890/16714323668557.jpg)
+The above diagram shows the definition of the interface, the Plugin interface 
in SeaTunnel abstracts the various actions of data processing into a Plugin.
+
+The five parts of the diagram below, Basesource, Basetransfform, Basesink, 
Runtimeenv, and Execution, all inherit from the Plugin interface.
+![](/image/16714322747890/16714323741126.jpg)
+
+As a process definition plug-in, Source is responsible for reading data, 
Transform is responsible for transforming, Sink is responsible for writing and 
Runtimeenv is setting the base environment variables.
+
+The overall SeaTunnel base API is shown below
+
+![](/image/16714322747890/16714323846302.jpg)
+Execution, the data flow builder used to build the entire data flow based on 
the first three, is also part of the base API
+
+![](/image/16714322747890/16714323920717.jpg)
+
+## SeaTunnel Base API Implementation
+
+Based on the previous basic APIs, SeaTunnel has been implemented in separate 
packages for different computing engines, currently the Spark API abstraction 
and the Flink API abstraction, which logically completes the process of 
building the data pipeline.
+
+![](/image/16714322747890/16714323741126.jpg)
+
+Due to space constraints, we will focus on Spark batch processing. Based on 
the wrapped implementation of the previous base Api, the first is that Base 
spark source implements Base source, base Spark transform implements Base 
transform and Base Spark sink implements Base sink.
+
+The method definition uses Spark’s Dataset as the carrier of the data, and all 
data processing is based on the Dataset, including reading, processing and 
exporting.
+
+The SparkEnvironment, which internally encapsulates Spark’s Sparksession in an 
Env, makes it easy for individual plugins to use.
+
+![](/image/16714322747890/16714324136843.jpg)
+
+The Spark batch process ends with SparkBatchExecution (the data stream 
builder), which is the core code snippet used to functionally build our data 
stream Pipeline, the most basic data stream on the left in the diagram below.
+
+The user-based definition of each process component is also the configuration 
of Source Sink, Transform. More complex data flow logic can be implemented, 
such as multi-source Join, multi-pipeline processing, etc., all of which can be 
built through Execution.
+
+![](/image/16714322747890/16714324237449.jpg)
+## SeaTunnel Connector V1 API Architecture Summary
+![](/image/16714322747890/16714324336701.jpg)
+SeaTunnel’s API consists of three main parts.
+
+The first part is the SeaTunnel base API, which provides the basic abstract 
interfaces such as Source, Sink, Transform, and Plugin.
+
+The second part is based on a set of interfaces Transform, Sink, Source, 
Runtime, and Execution provided by the SeaTunnel base API, which is wrapped and 
implemented on the Flink and Spark engines respectively, i.e. Spark engine API 
layer abstraction and Flink engine API layer abstraction.
+
+Both Flink and Spark engines support stream and batch processing, so there are 
different ways to use streams/batches under the Flink API abstraction and Spark 
abstraction APIs, such as Flinkstream and Flinkbatch under the Flink 
abstraction API, and Sparkbatch and Sparkstreaming under the Spark abstraction 
API.
+
+The third part is the plug-in system, based on Spark abstraction and Flink API 
abstraction, SeaTunnel engine implements rich connectors and processing 
plug-ins, while developers can also be based on different engine API 
abstractions, and extensions to achieve their own Plugin.
+
+SeaTunnel Implementation Principle
+Currently, SeaTunnel offers a variety of ways to use Flink, Spark, and 
FlinkSQL. Due to space limitations, we will introduce the execution principles 
of the Spark method.
+
+First, the entry starts the command Start-seatunnel-spark.sh via the shell, 
which internally calls Sparkstarter’s Class, which parses the parameters passed 
by the shell script, and also parses the Config file to determine which 
Connectors are defined in the Config file, such as Fake, Console, etc.
+![](/image/16714322747890/16714324454477.jpg)
+Then find the Connector path from the Connector plugin directory and stitch it 
into the Spark-submit launch command with — jar, so that the found Plugin jar 
package can be passed to the Spark cluster as a dependency.
+
+For Connector plugins, all Spark Connectors are packaged in the plugin 
directory of the distribution (this directory is managed centrally).
+
+After Spark-submit is executed, the task is submitted to the Spark cluster, 
and the Main class of the Spark job’s Driver builds the data flow Pipeline 
through the data flow builder Execution, combined with Souce, Sink, and 
Transform so that the whole chain is connected.
+
+## SeaTunnel Connector V2 API Architecture
+
+In the latest community release of SeaTunnel 2.2.0-beta, the refactoring of 
the Connectorapi, now known as the SeaTurnelV2 API, has been completed!
+
+Why do we need to reconfigure?
+
+As the Container is currently a strongly coupled engine, i.e. Flink and Spark 
API, if the Flink or Spark engine is upgraded, the Connector will also have to 
be adjusted, possibly with changes to parameters or interfaces.
+
+This can lead to multiple implementations for different engines and 
inconsistent parameters to develop a new Connector. Therefore, the community 
has designed and implemented the V2 version of the API based on these pain 
points.
+
+![](/image/16714322747890/16714324726276.jpg)
+
+## SeaTunnel V2 API Architecture
+
+SeaTunnel V2 API Architecture
+
+### 1.Table API
+
+·DataType: defines SeaTunnel’s data structure SeaTunnelRow, which is used to 
isolate the engine
+
+·Catalog: used to obtain Table Scheme, Options, etc..
+
+·Catalog Storage: used to store user-defined Table Schemes etc. for 
unstructured engines such as Kafka.
+
+·Table SPI: mainly used to expose the Source and Sink interfaces as an SPI
+
+### 2. Source & Sink API
+
+Define the Connector’s core programming interface for implementing the 
Connector
+
+### 3.Engine API
+·Translation: The translation layer, which translates the Source and Sink APIs 
implemented by the Connector into a runnable API inside the engine.
+
+·Execution: Execution logic, used to define the execution logic of Source, 
Transform, Sink and other operations within the engine.
+
+The Source & Sink API is the basis for the implementation of the connector and 
is very important for developers.
+
+The design of the v2 Source & Sink API is highlighted below
+
+## SeaTunnel Connector V2 Source API
+The current version of SeaTunnel’s API design draws on some of Flink’s design 
concepts, and the more core classes of the Source API are shown below.
+
+![](/image/16714322747890/16714325444078.jpg)
+![](/image/16714322747890/16714325474972.jpg)
+The core Source API interaction flow is shown above. In the case of concurrent 
reads, the enumerator SourceSplitEnumerator is required to split the task and 
send the SourceSplit down to the SourceReader, which receives the split and 
uses it to read the external data source.
+
+In order to support breakpoints and Eos semantics, it is necessary to preserve 
and restore the state, for example by preserving the current Reader’s Split 
consumption state and restoring it after a failure in each Reader through the 
Checkpoint state and Checkpoint mechanism, so that the data can be read from 
the place where it failed.
+
+## SeaTunnel Connector V2 Sink API
+![](/image/16714322747890/16714325600316.jpg)
+The overall Sink API interaction flow is shown in the diagram below. The 
SeaTunnel sink is currently designed to support distributed transactions, based 
on a two-stage transaction commit.
+
+First SinkWriter continuously writes data to an external data source, then 
when the engine does a checkpoint, it triggers a first-stage commit.
+
+SinkWriter needs to do a Prepare commit, which is the first stage of the 
commit.
+
+The engine will determine if all the Writer's first stage succeeds, and if 
they all succeed, the engine will combine the Subtask’s Commit info with the 
Commit method of the Committer to do the actual commit of the transaction and 
operate the database for the Commit, i.e. the second stage of the commit. This 
is the second stage of commit.
+
+![](/image/16714322747890/16714325681738.jpg)
+For the Kafka sink connector implementation, the first stage is to do a 
pre-commit by calling KafkaProducerSender.prepareCommit().
+
+The second commit is performed via Producer.commitTransaction();.
+
+flush(); flushes the data from the Broker’s system cache to disk.
+
+Finally, it is worth noting!
+
+Both SinkCommitter and SinkAggregatedCommitter can perform a second stage 
commit to replace the Committer in the diagram. The difference is that 
SinkCommitter can only do a partial commit of a single Subtask’s CommitInfo, 
which may be partially successful and partially unsuccessful, and cannot be 
handled globally. The difference is that the SinkCommitter can only do partial 
commits of a single Subtask’s CommitInfo, which may be partially successful and 
partially unsuccessful.
+
+SinkAggregatedCommitter is a single parallel, aggregating the CommitInfo of 
all Subtask, and can do the second stage commit as a whole, either all succeed 
or all fail, avoiding the problem of inconsistent status due to partial failure 
of the second stage.
+
+It is therefore recommended that the SinkAggregatedCommitter be used in 
preference.
+
+## Comparison of SeaTunnel V1 and V2 API processing flows
+We can look at the changes before and after the V1 V2 upgrade from a data 
processing perspective, which is more intuitive, Spark batch processing as an 
example: SeaTunnel V1: The entire data processing process is based on the Spark 
dataset API, and the Connector and the compute engine are strongly coupled.
+
+![](/image/16714322747890/16714325887598.jpg)
+SeaTunnel V2: Thanks to the work of the engine translator, the Connector API, 
and the SeaTunnelRow, the data source of the SeaTunnel internal data structures 
accessed through the Connector, are translated by the translation layer into a 
runnable Spark API and spark dataset that is recognized inside the engine 
during data transformation.
+
+As data is written out, the Spark API and Spark dataset are translated through 
the translation layer into an executable connector API inside the SeaTunnel 
connector and a data source of internal SeaTunnel structures that can be used.
+
+> Overall, the addition of a translation layer at the API and compute engine 
layers decouples the Connector API from the engine, and the Connector 
implementation no longer depends on the compute engine, making the extension 
and implementation more flexible.
+
+> In terms of community planning, the V2 API will be the main focus of 
development, and more features will be supported in V2, while V1 will be 
stabilized and no longer maintained.
+
+## Practice and reflections on our off-line development scheduling platform
+
+### Practice and reflections on our off-line development scheduling platform
+
+![](/image/16714322747890/16714326227360.jpg)
+Hornet’s Nest Big Data Development Platform, which focuses on providing 
one-stop big data development and scheduling services, helps businesses solve 
complex problems such as data development management, task scheduling and task 
monitoring in offline scenarios.
+
+The offline development and scheduling platform plays the role of the top and 
the bottom. The top is to provide open interface API and UI to connect with 
various data application platforms and businesses, and the bottom is to drive 
various computations and storage, and then run in an orderly manner according 
to the task dependency and scheduling time.
+
+## Platform Capabilities
+**Data development**
+
+Task configuration, quality testing, release live
+
+**·Data synchronisation**
+
+Data access, data processing, data distribution
+
+**·Scheduling capabilities**
+
+Supports timed scheduling, triggered scheduling
+
+**·Operations and Maintenance Centre
+**
+Job Diagnosis, Task O&M, Instance O&M
+
+**·Management**
+
+Library table management, permission management, API management, script 
management
+
+In summary, the core capabilities of the offline development scheduling 
platform are openness, versatility, and one-stop shopping. Through standardized 
processes, the entire task development cycle is managed and a one-stop service 
experience is provided.
+
+## The architecture of the platform
+![](/image/16714322747890/16714326749427.jpg)
+The Hornet’s Nest Big Data Development and Scheduling Platform consists of 
four main modules: the task component layer, the scheduling layer, the service 
layer, and the monitoring layer.
+
+The service layer is mainly responsible for job lifecycle management (e.g. job 
creation, testing, release, offline); Airflow dagphthon file building and 
generating, task bloodline dependency management, permission management, API 
(providing data readiness, querying of task execution status).
+
+The scheduling layer is based on Airflow and is responsible for the scheduling 
of all offline tasks.
+
+A task component layer that enables users to develop data through supported 
components that include tools such as SparkSQL/, HiveSQ, LMR), StarRocks 
import, etc., directly interfacing with underlying HDFS, MySQL, and other 
storage systems.
+
+The monitoring layer is responsible for all aspects of monitoring and alerting 
on scheduling resources, computing resources, task execution, etc.
+
+## Open Data Sync Capability Scenarios
+Challenges with open capabilities: Need to support multiple business scenarios 
and meet flexible data pipeline requirements (i.e. extend to support more task 
components such as hive2clickhourse, clickhourse2mysql, etc.)
+
+Extending task components based on Airflow: higher maintenance costs for 
extensions, need to reduce costs and increase efficiency (based on the limited 
provider's Airflow offers, less applicable in terms of usage requirements, 
Airflow is a Python technology stack, while our team is mainly based on the 
Java technology stack, so the technology stack difference brings higher 
iteration costs)
+
+Self-developed task components: the high cost of platform integration, long 
development cycle, high cost of the configuration of task components. (Research 
or implement task components by yourself, different ways of adapting the 
parameters of the components in the service layer, no uniform way of parameter 
configuration)
+
+We wanted to investigate a data integration tool that, firstly, supported a 
rich set of components, provided out-of-the-box capabilities, was easy to 
extend, and offered a uniform configuration of parameters and a uniform way of 
using them to facilitate platform integration and maintenance.
+
+* Selection of data integration tools
+![](/image/16714322747890/16714327002726.jpg)
+To address the pain points mentioned above, we actively explored solutions and 
conducted a selection analysis of several mainstream data integration products 
in the industry. As you can see from the comparison above, Datax and SeaTunnel 
both offer good scalability, and high stability, support rich connector 
plugins, provide scripted, uniformly configurable usage, and have active 
communities.
+
+However, Datax is limited by being distributed and is not well suited to 
massive data scenarios.
+
+In contrast, SeaTunnel offers the ability to provide distributed execution, 
distributed transactions, scalable levels of data handling, and the ability to 
provide a unified technical solution in data synchronization scenarios.
+
+In addition to the advantages and features described above and the applicable 
scenarios, more importantly, the current offline computing resources for big 
data are unified and managed by yarn, and for the subsequently extended tasks 
we also wish to execute on Yarn, we finally prefer SeaTunnel for our usage 
scenarios.
+
+Further performance testing of SeaTunnel and the development of an open data 
scheduling platform to integrate SeaTunnel may be carried out at a later stage, 
and its use will be rolled out gradually.
+
+## Outbound scenario: Hive data sync to StarRocks
+
+To briefly introduce the background, the Big Data platform has now completed 
the unification of the OLAP engine layer, using the StarRocks engine to replace 
the previous Kylin engine as the main query engine in OLAP scenarios.
+
+In the data processing process, after the data is modelled in the data 
warehouse, the upper model needs to be imported into the OLAP engine for query 
acceleration, so there are a lot of tasks to push data from Hive to StarRocks 
every day. task (based on a wrapper for the StarRocks Broker Load import 
method) to a StarRocks-based table.
+
+The current pain points are twofold.
+
+·Long data synchronization links: Hive2StarRocks processing links, which 
require at least two tasks, are relatively redundant.
+
+·Outbound efficiency: From the perspective of outbound efficiency, many Hive 
models themselves are processed by Spark SQL, and based on the processing the 
Spark Dataset in memory can be pushed directly to StarRocks without dropping 
the disk, improving the model’s regional time.
+
+![](/image/16714322747890/16714327218590.jpg)
+StarRocks currently also supports Spark Load, based on the Spark bulk data 
import method, but our ETL is more complex, needs to support data conversion 
multi-table Join, data aggregation operations, etc., so temporarily can not 
meet.
+
+We know from the SeaTunnel community that there are plans to support the 
StarRocks Sink Connector, and we are working on that part as well, so we will 
continue to communicate with the community to build it together later.
+
+## How to get involved in community building
+### SeaTunnel Community Contribution
+As mentioned earlier, the community has completed the refactoring of the V1 to 
V2 API and needs to implement more connector plug-ins based on the V2 version 
of the connector API, which I was lucky enough to contribute to.
+
+I am currently responsible for big data infrastructure work, which many 
mainstream big data components big data also use, so when the community 
proposed a connector issue, I was also very interested in it.
+
+As the platform is also investigating SeaTunnel, learning and being able to 
contribute pr to the community is a great way to learn about SeaTunnel.
+
+I remember at first I proposed a less difficult pr to implement the WeChat 
sink connector, but in the process of contributing I encountered many problems, 
bad coding style, code style did not take into account the rich output format 
supported by the extension, etc. Although the process was not so smooth, I was 
really excited and accomplished when the pr was merged. Although the process 
was not so smooth, it was very exciting and rewarding when the pr was merged.
+
+As I became more familiar with the process, I became much more efficient at 
submitting pr and was confident enough to attempt difficult issues.
+### How to get involved in community contributions quickly
+* Good first issue
+Good first issue #3018 #2828
+
+If you are a first-time community contributor, it is advisable to focus on the 
Good first issue first, as it is basically a relatively simple and 
newcomer-friendly issue.
+
+Through Good first issue, you can get familiar with the whole process of 
participating in the GitHub open source community contribution, for example, 
first fork the project, then submit the changes, and finally submit the pull 
request, waiting for the community to review, the community will target to you 
to put forward some suggestions for improvement, directly will leave a comment 
below, until when your pr is merged in, this will have completed a comp
+
+* Subscribe to community mailings
+Once you’re familiar with the pr contribution process, you can subscribe to 
community emails to keep up to date with what’s happening in the community, 
such as what features are currently being worked on and what’s planned for 
future iterations. If you’re interested in a feature, you can contribute to it 
in your own situation!
+* Familiarity with git use
+The main git commands used in development are git clone, git pull, git rebase 
and git merge. git rebase is recommended in the community development 
specification and does not generate additional commits compared to git merge.
+* Familiarity with GitHub project collaboration process
+Open source projects are developed collaboratively by multiple people, and the 
collaboration method on GitHub is at its core outlined in fork For example, the 
apache st project, which is under the apache space, is first forked to our own 
space on GitHub
+
+Then modify the implementation, mention a pull request, and submit the pull 
request to be associated with the issue, in the commit, if we change a long 
time, in the upward commit, then the target branch has a lot of new commits 
exhausted this time we need to do a pull& merge or rebase.
+
+* Source code compilation project
+It is important to be familiar with source compilation, as local source 
compilation can prove that the code added to a project can be compiled, and can 
be used as a preliminary check before committing to pr. Source compilation is 
generally slow and can be speeded up by using mvn -T for multi-threaded 
parallel compilation.
+* Compilation checks
+Pre-compilation checks, including Licence header, Code checkstyle, and 
Document checkstyle, will be checked during Maven compilation, and if they 
fail, the CI will not be passed. So it is recommended to use some plug-in tools 
in the idea to improve the efficiency, such as Code checkstyle has a plug-in to 
automatically check the code specification, Licence header can add code 
templates in the idea, these have been shared by the community before how to do!
+* Add full E2E
+
+Add full E2E testing and ensure that the E2E is passed before the Pull request.
+
+Finally, I hope more students will join the SeaTunnel community, where you can 
not only feel the open-source spirit and culture of Apache but also understand 
the management process of Apache projects and learn good code design ideas.
+
+We hope that by working together and growing together, we can build SeaTunnel 
into a top-notch data integration platform.
+
diff --git 
a/blog/2022-12-10-SeaTunnel-supports-IoTDB-to-implement-IoT-data-synchronization.md
 
b/blog/2022-12-10-SeaTunnel-supports-IoTDB-to-implement-IoT-data-synchronization.md
new file mode 100644
index 0000000000..5f37274b5f
--- /dev/null
+++ 
b/blog/2022-12-10-SeaTunnel-supports-IoTDB-to-implement-IoT-data-synchronization.md
@@ -0,0 +1,209 @@
+---
+slug: Apache IoTDB (Internet of Things Database) is a software system that 
integrates the collection
+title: SeaTunnel supports IoTDB to implement IoT data synchronization
+tags: [Meetup]
+---
+# SeaTunnel supports IoTDB to implement IoT data synchronization
+![](/image/16714316310459/16714316482580.jpg)
+> Apache IoTDB (Internet of Things Database) is a software system that 
integrates the collection, storage, management, and analysis of time series 
data of the Internet of Things, which can meet the needs of massive data 
storage, high-speed data reading, and complex data analysis in the field of 
Industrial Internet of Things. Currently, SeaTunnel already supports IoTDB 
Connector, realizing the connection of data synchronization scenarios in the 
IoT field.
+
+> At the SeaTunnel community online meeting in October this year, SeaTunnel 
Committer Wang Hailin introduced the implementation process of SeaTunnel’s 
access to IoTDB, allowing users to have a deeper understanding of the operation 
method and principle of IoTDB data synchronization.
+
+The topic I’m sharing today is using SeaTunnel to play around with data 
synchronization in IoTDB.
+
+This session is divided into 6 subsections. Firstly, we will have an 
understanding of the basic concept of SeaTunnel, and on this basis, we will 
focus on the functional features of IoTDB Connector, then we will analyze the 
data read and write functions of IoTDB Connector and the parsing of the 
implementation, and finally, we will show some typical usage scenarios and 
cases to let you understand how to use Finally, we will show some typical usage 
scenarios and cases to understand how to u [...]
+
+## Introduction to SeaTunnel basic concepts
+This is the basic architecture of SeaTunnel, an engine built for data 
synchronization, with a set of abstract APIs for reading data from and writing 
to a variety of data sources.
+
+![](/image/16714316310459/16714316839299.jpg)
+The left-hand side briefly lists the Source scenarios, for example, we 
abstract the Source’s API, Type, and State, to read the data source, unifying 
the data types of the various data sources to the abstract type defined in it, 
and some state recovery and retention of the read location during the reading 
process.
+
+This is an abstraction for Source, and we have done a similar abstraction for 
Sink, i.e. how data is written, and how the data type matches the real data 
source type, and how the state is restored and retained.
+
+Based on these APIs, we will have a translation layer to translate these APIs 
to the corresponding execution engine. SeaTunnel currently supports three 
execution engines, Spark, Flink, and our own execution engine, SeaTunnel 
Engine, which will be released soon.
+
+This is roughly what SeaTunnel does, SeaTunnel relies on Source and Sink to 
read and write data for data synchronization, we call them Connectors. The 
Connector consists of a Source and a Sink.
+
+![](/image/16714316310459/16714316928812.jpg)
+From the diagram above we see the different data sources, Source is 
responsible for reading data from the various data sources and transforming it 
into SeaTunnelRow abstraction layer and Type to form the abstraction layer, 
Sink is responsible for pulling data from the abstraction layer and writing it 
to the concrete data store to transform it into the store concrete format.
+
+The combination of Source + Abstraction Layer + Sink enables the 
synchronization of data between multiple heterogeneous data sources.
+
+I’ll use a simple example below to illustrate how SeaTunnel’s Source and Sink 
work.
+
+![](/image/16714316310459/16714317022389.jpg)
+
+![](/image/16714316310459/16714317067444.jpg)
+We can specify the number of Sources, Sink configuration file combinations 
through the configuration file The commands in the toolkit provided by 
SeaTunnel take the configuration file with them and when executed enable data 
handling.
+
+![](/image/16714316310459/16714317166018.jpg)
+![](/image/16714316310459/16714317203806.jpg)
+![](/image/16714316310459/16714317262218.jpg)
+This is the Connector ecosystem that is currently supported by SeaTunnel, such 
as the data sources supported by JBDC, HDFS, Hive, Pulsar, message queues, etc. 
are currently supported.
+
+The list in the picture is not exhaustive of the Connectors supported by 
SeaTunnel. Under the GitHub SeaTunnel project, you can see the Plugins 
directory, where supported Connector plugins are constantly being added and 
where you can see the latest access in real-time.
+
+## IoTDB Connector Features
+Below is information about access to the IoTDB Connector.
+
+Firstly, we would like to introduce the functional features of IoTDB, the 
IoTDB Connector integrated with SeaTunnel, and what exactly it supports for 
your reference.
+
+## Source Features
+![](/image/16714316310459/16714317512435.jpg)
+Firstly, there are the typical usage scenarios supported by Source, such as 
bulk reading of devices, field projection, data type mapping, parallel reading, 
etc.
+
+As you can see above, IoTDB supports all features except once, exactly once 
and stream mode, such as batch reads, IoTDB has a SQL syntax similar to group 
by device, which allows you to read data from multiple devices in a single 
batch. For basic data type projection, the SQL in IoTDB will take time by 
default when looking up any metric, or group by the device will take the device 
column, and we also support projection onto SeaTunnel columns by default.
+
+The only data type not supported is Victor, all others are supported.
+
+For the parallel read piece, the IoTDB data is actually timestamped and we use 
timestamped ranges to achieve parallel reads.
+
+The recovery of the state, since we have divided the time range read into 
different splits, can be done based on the Split location information.
+
+## Sink functional features
+![](/image/16714316310459/16714317679569.jpg)
+
+The diagram above shows the features already supported by SeaTunnel. Regarding 
metadata extraction, we support the extraction of metadata such as measurement, 
device, etc. from SeaTunnelRow and the extraction or use of current processing 
time from SeaTunnelRow. Batch commits and exception retries are also supported.
+## IoTDB data reading analysis
+Next, we analyze the implementation and support for data reading.
+## Data type mapping
+The first is the data type mapping, which actually reads the IoTDB data type 
to SeaTunnel, so it has to be converted to the SeaTunnel data type.
+![](/image/16714316310459/16714317930593.jpg)
+The BOOLEAN, INT32, INT64, etc. listed here all have corresponding SeaTunnel 
data types. INT32 can be mapped according to the read type on the SeaTunnel, or 
to TINYINT, SMALLINT, or INT when the range of values is small.
+
+The Vector type is not currently supported.
+
+![](/image/16714316310459/16714318216373.jpg)
+This is the corresponding example code showing how the mapping is done where 
the type conversion is done.
+
+## Field projection
+
+The other is the field projection when reading, we can automatically map Time 
fields when reading IoTDB data, or we can choose to map some of the data to 
SeaTunnel, such as TIMESTAMP, or BIGINT.
+
+![](/image/16714316310459/16714318381313.jpg)
+The SQL extraction of column codes allows you to extract only some of the 
columns you need, and when used on SeaTunnel, you can specify the name, type, 
etc. of the column after it is mapped to SeaTunnel via fields. The final result 
of the data read on SeaTunnel is shown in the figure above.
+
+![](/image/16714316310459/16714318550071.jpg)
+
+We have just seen that we do not have the time column in the SQL, but the 
actual result is that there is this column, so we support the projection of the 
time column field, the time column can actually be projected into different 
data types, the user can convert according to their needs. The diagram above 
shows the implementation logic.
+
+## Batch read Device
+This is a common requirement, as we are likely to synchronize data in large 
batches with the same data structure.
+
+![](/image/16714316310459/16714318796196.jpg)
+
+SeaTunnel supports the align-by-device syntax so that device columns can also 
be projected onto the SeaTunnelRow
+
+![](/image/16714316310459/16714318873362.jpg)
+Assuming there is a table in IoTDB, we project the device column onto 
SeaTunnel by making it data as well through syntax. After configuring the 
device name column and specifying the data type, we end up reading the data on 
SeaTunnel in the format shown above, containing the Time, device column, and 
the actual data value. This makes it possible to read data from the same device 
in bulk.
+
+## Parallel reading
+The other is a parallel read.
+
+* Split
+We have scoped the table by the Time column and if we are reading in parallel 
we may want to scope the table to allow parallel threads/processes to read a 
specific range of data. By configuring the three parameters, the end result 
will be a query SQL, where the original SQL is divided into different splits 
with query conditions to achieve the actual read SQL.
+
+* Allocate Split to the reader
+Once the split is done, there is an allocation logic to follow in order to 
distribute it to each parallel reader.
+
+![](/image/16714316310459/16714319281280.jpg)
+
+This logic is based on the ID of the split to the reader, which may be more 
random, or more uniform if the ID of the split is more hashed, depending on the 
Connector.
+
+![](/image/16714316310459/16714319372730.jpg)
+
+The result achieved is shown in the picture.
+
+# Status recovery
+
+There is also state recovery involved when reading because if the task is 
large, the reading will take longer, and if there is an error or exception in 
the middle, you have to consider how to recover the state from the point where 
the error occurred, and then read it again afterward.
+
+![](/image/16714316310459/16714319281280.jpg)
+![](/image/16714316310459/16714319569097.jpg)
+![](/image/16714316310459/16714319599521.jpg)
+
+SeaTunnel’s state recovery is mainly through the reader storing the unread 
Split information into the state, and then the engine will periodically take a 
snapshot of the state when reading so that we can restore the last snapshot 
when we recover and continue reading afterward.
+
+## IoTDB Connector Data Write Analysis
+The next step is the parsing of the data writes.
+
+## Data type mapping
+![](/image/16714316310459/16714319862080.jpg)
+
+Data writing also involves data type mapping, but here, in contrast to data 
reading, it maps the SeaTunnel data types to the IoTDB data types. As IoTDB 
only has INT32, the writing process involves lifting the data types TINYINT and 
SMALLINT. All other data types can be converted one-to-one; ARRAY and VECTOR 
data types are not yet supported.
+
+![](/image/16714316310459/16714319949478.jpg)
+
+The above diagram shows the corresponding code, the implementation logic will 
need to be seen in our specific mapping.
+
+## Dynamic injection of metadata
+SeaTunnel supports the dynamic injection of metadata.
+
+When heterogeneous data sources are written to the IoTDB, device, measurement, 
and time are extracted from each row of data, either by serializing the 
SeaTunnelRow with a fixed column value as configured. Alternatively, the system 
time can be used as the time, or the current system time can be populated if no 
time column is specified, and the storage group can be configured to be 
automatically appended to the device prefix.
+
+![](/image/16714316310459/16714320117277.jpg)
+
+For example, suppose that the structure of a row in SeaTunnel reading the data 
format shown above can be configured to synchronize to the IoTDB and the result 
obtained is as follows.
+
+![](/image/16714316310459/16714320201848.jpg)
+
+The temperature and humidity columns we need were extracted, and ts and device 
names were extracted as the original data for the IoTDB.
+
+## Batch commits and exception retries
+
+In addition, Sink needs to handle batch and retry when writing. For batches, 
we can configure the appropriate batch configuration, including support for 
configuring the number and interval of batch commits; if the data is cached to 
memory, you can enable a separate thread for timed commits.
+
+For retries, SeaTunnel supports the configuration of the number of retries, 
the waiting interval and the maximum number of retries, as well as the 
possibility to end a retry if it encounters a non-recoverable error when it has 
finished.
+
+![](/image/16714316310459/16714320394193.jpg)
+
+## IoTDB Connector Usage Examples
+
+After the previous analysis of reading and writing data, let’s look at three 
typical examples of usage scenarios.
+
+## Exporting data from IoTDB
+The first scenario is exporting data from the IoTDB, the example I have given 
here is reading data from the IoTDB to the Console.
+
+* Read in parallel, output to Console
+
+Parallelism: 2
+
+Number of batches: 24
+
+Time frame: 2022–09–25 ~ 2022–09–26
+![](/image/16714316310459/16714320856052.jpg)
+
+Let’s assume that we have a data table in IoTDB and we want to export the data 
to the Console. The whole configuration is shown above and needs to map the 
columns of data we want to export and the time range to check.
+
+This is the simplest example, but in practice, the Sink side may be more 
complex, so you will need to refer to the documentation of the corresponding 
data source for the appropriate configuration.
+
+## Importing data to IoTDB
+
+* Read database, batch write to IoTDB
+    * Batch writing: one commit every 1024 entries or every 1000 ms
+    
+    * -Extracting metadata device, timestamp, measurement
+    
+    * -Specify the storage group: root.test_group
+
+![](/image/16714316310459/16714321290339.jpg)
+Another typical usage scenario is to import data from other data sources into 
IoTDB. suppose I have an external database table with columns like ts, 
temperature, humidity, etc. and we import it into IoTDB, requiring the columns 
of temperature and humidity, but the rest can be left out. The whole 
configuration is shown in the diagram above, you can refer to it.
+
+On the Sink side, you mainly have to specify the Key of the device column, 
such as from which data the device is extracted, from which class the time is 
extracted, which columns to write to the IoTDB, etc.
+
+As you can see, we can configure the storage group, which is the storage group 
of the IoTDB, which can be specified by the storage group.
+
+## Synchronizing data between IoTDB
+The third scenario is to synchronize data between IoTDB and IoTDB and write to 
IoTDB in bulk, suppose there is a table in IoTDB that needs to be synchronized 
to another IoTDB, after synchronization the storage group has changed and the 
name of the indicator of the data column has also changed, then you can use 
projection to rewrite the indicator name and use SQL to rewrite the storage 
group.
+
+![](/image/16714316310459/16714321480992.jpg)
+
+## How to get involved in contribution
+Finally, a few words about the next steps for the IoTDB Connector and how you 
can get involved in improving the Connector and contributing new features that 
are needed.
+
+## Next steps for the IoTDB Connector
+
+* Support for reading and writing vector data types
+* Support for tsfile reads and writes
+* Support for writing tsfile and reloading to IoTDB
\ No newline at end of file
diff --git 
a/blog/2022-12-9-SeaTunnel-engine-designed-for-tens-of-billions-data-integration.md
 
b/blog/2022-12-9-SeaTunnel-engine-designed-for-tens-of-billions-data-integration.md
new file mode 100644
index 0000000000..6ab6121fae
--- /dev/null
+++ 
b/blog/2022-12-9-SeaTunnel-engine-designed-for-tens-of-billions-data-integration.md
@@ -0,0 +1,168 @@
+---
+slug: Apache SeaTunnel Committer | Zongwen Li
+title: SeaTunnel engine, designed for tens-of-billions data integration
+tags: [Meetup]
+---
+# SeaTunnel engine, designed for tens-of-billions data integration
+![](/image/16714309762810/16714309876928.jpg)
+Apache SeaTunnel Committer | Zongwen Li
+
+## Introduction to Apache SeaTunnel
+Apache SeaTunnel is a very easy-to-use ultra-high-performance distributed data 
integration platform that supports real-time synchronization of massive data.
+
+Apache SeaTunnel will try its best to solve the problems that may be 
encountered in the process of mass data synchronization, such as data loss and 
duplication, task accumulation and delay, low throughput, etc.
+
+## Milestones of SeaTunnel
+SeaTunnel, formerly known as Waterdrop, was open-sourced on GitHub in 2017.
+
+In October 2021, the Waterdrop community joined the Apache incubator and 
changed its name to SeaTunnel.
+
+## SeaTunnel Growth
+
+![](/image/16714309762810/16714310892722.jpg)
+![](/image/16714309762810/16714310916195.jpg)
+![](/image/16714309762810/16714310939883.jpg)
+When SeaTunnel entered the Apache incubator, the SeaTunnel community ushered 
in rapid growth.
+
+As of now, the SeaTunnel community has a total of 151 contributors, 4314 
Stars, and 804 forks.
+
+## Pain points of Existing engines
+There are many pain points faced by the existing computing engines in the 
field of data integration, and we will talk about this first. The pain points 
usually lie in three directions:
+
+* The fault tolerance ability of the engine;
+* Difficulty in configuration, operation, and maintenance of engine jobs;
+* The resource usage of the engine.
+
+## fault tolerance
+Global Failover
+![Global-failover](/image/16714309762810/16714311670656.jpg)
+For distributed streaming processing systems, high throughput and low latency 
are often the most important requirements. At the same time, fault tolerance is 
also very important in distributed systems. For scenarios that require high 
correctness, the implementation of exactly once is often very important.
+
+In a distributed streaming processing system, since the computing power, 
network, load, etc. of each node are different, the state of each node cannot 
be directly merged to obtain a true global state. To obtain consistent results, 
the distributed processing system needs to be resilient to node failure, that 
is, it can recover to consistent results when it fails.
+
+Although it is claimed in their official blog that Spark’s Structured 
Streaming uses the Chandy-Lamport algorithm for Failover processing, it does 
not disclose more details.
+
+Flink implemented Checkpoint as a fault-tolerant mechanism based on the above 
algorithm and published related papers: Lightweight Asynchronous Snapshots for 
Distributed Dataflows
+
+In the current industrial implementation, when a job fails, all nodes of the 
job DAG need to failover, and the whole process will last for a long time, 
which will cause a lot of upstream data to accumulate.
+
+## Loss of Data
+![](/image/16714309762810/16714312426416.jpg)
+The previous problem will cause a long-time recovery, and the business service 
may accept a certain degree of data delay.
+
+In a worse case, a single sink node cannot be recovered for a long time, and 
the source data has a limited storage time, such as MySQL and Oracle log data, 
which will lead to data loss.
+
+## Configuration is cumbersome
+Single table Configuration
+
+![](/image/16714309762810/16714312637015.jpg)
+The previous examples are cases regarding a small number of tables, but in 
real business service development, we usually need to synchronize thousands of 
tables, which may be divided into databases and tables at the same time;
+
+The status quo is that we need to configure each table, a large number of 
table synchronization takes a lot of time for users, and it is prone to 
problems such as field mapping errors, which are difficult to maintain.
+
+## Not supporting Schema Evolution
+
+![Not-supports-DDL](/image/16714309762810/16714312769761.jpg)
+Besides, according to the research report of Fivetran, 60% of the company’s 
schema will change every month, and 30% will change every week.
+
+However, none of the existing engines supports Schema Evolution. After 
changing the Schema each time, the user needs to reconfigure the entire link, 
which makes the maintenance of the job very cumbersome.
+
+## The high volume of resource usage
+
+The database link takes up too much
+
+![](/image/16714309762810/16714313100541.jpg)
+If our Source or Sink is of JDBC type, since the existing engine only supports 
one or more links per table, when there are many tables to be synchronized, 
more link resources will be occupied, which will bring a great burden to the 
database server.
+## Operator pressure is uncontrollable
+
+![](/image/16714309762810/16714313301435.jpg)
+In the existing engine, a buffer and other control operators are used to 
control the pressure, that is, the back pressure mechanism; since the back 
pressure is transmitted level by level, there will be pressure delay, and at 
the same time, the processing of data will not be smooth enough, increasing the 
GC time, fault-tolerant completion time, etc.
+
+Another case is that neither the source nor the sink has reached the maximum 
pressure, but the user still needs to control the synchronization rate to 
prevent too much impact on the source database or the target database, which 
cannot be controlled through the back pressure mechanism.
+
+## Architecture goals of Apache SeaTunnel Engine
+To solve these severe issues faced by computing engines, we self-developed our 
engine expertise in big data integration.
+
+Firstly, let’s get through what goals this engine wants to achieve.
+
+## Pipeline Failover
+
+![](/image/16714309762810/16714313559400.jpg)
+In the data integration case, there is a possibility that a job can 
synchronize hundreds of sheets, and the failure of one node or one table will 
lead to the failure of all tables, which is too costly.
+
+We expect that unrelated Job Tasks will not affect each other during fault 
tolerance, so we call a vertex collection with upstream and downstream 
relationships a Pipeline, and a Job can consist of one or more pipelines.
+
+## Regional Failover
+Now if there is an exception in the pipeline, we still need to failover all 
the vertex in the pipeline; but can we restore only part of the vertex?
+![](/image/16714309762810/16714313919617.jpg)
+For example, if the Source fails, the Sink does not need to restart. In the 
case of a single Source and multiple Sinks, if a single Sink fails, only the 
Sink and Source that failed will be restored; that is, only the node that 
failed and its upstream nodes will be restored.
+
+Obviously, the stateless vertex does not need to be restarted, and since 
SeaTunnel is a data integration framework, we do not have aggregation state 
vertexes such as Agg and Count, so we only need to consider Sink;
+
+* Sink does not support idempotence & 2PC; no restart and restart will result 
in the same data duplication, which can only be solved by Sink without 
restarting;
+* Sink supports idempotence, but does not support 2PC: because it is 
idempotent writing, it does not matter whether the source reads data 
inconsistently every time, and it does not need to be restarted;
+* Sink supports 2PC:
+* If the Source supports data consistency, if an abort is not executed, the 
processed old data will be automatically ignored through the channel data ID, 
and at the same time, it will face the problem that the transaction session 
time may time out;
+* If the Source does not support data consistency, perform abort on the Sink 
to discard the last data, which has the same effect as restarting but does not 
require initialization operations such as re-establishing links;
+* That is, the simplest implementation is to execute abort.
+We use the pipeline as the minimum granularity for fault-tolerant management, 
and use the Chandy-Lamport algorithm to realize fault-tolerant distributed jobs.
+
+## Data Cache
+![](/image/16714309762810/16714314318184.jpg)
+For sink failure, when data cannot be written, a possible solution is to work 
two jobs at the same time.
+
+One job reads the database logs using the CDC source connector and then writes 
the data to Kafka using the Kafka Sink connector. Another job reads data from 
Kafka using the Kafka source connector and writes data to the destination using 
the destination sink connector.
+
+This solution requires users to have a deep understanding of the underlying 
technology, and both tasks will increase the difficulty of operation and 
maintenance. Because every job needs JobMaster, it requires more resources.
+
+Ideally, the user only knows that they will be reading data from the source 
and writing data to the sink, and at the same time, during this process, the 
data can be cached in case the sink fails. The sync engine needs to 
automatically add caching operations to the execution plan and ensure that the 
source still works in the event of a sink failure. In this process, the engine 
needs to ensure that the data written to the cache and read from the cache are 
transactional, to ensure data cons [...]
+
+## Sharding & Multi-table Sync
+![](/image/16714309762810/16714314489916.jpg)
+
+For a large number of table synchronization, we expect that a single Source 
can support reading multiple structural tables, and then use the side stream 
output to keep consistent with a single table stream.
+
+The advantage of this is that it can reduce the link occupation of the data 
source and improve the utilization rate of thread resources.
+
+At the same time, in SeaTunnel Engine, these multiple tables will be regarded 
as a pipeline, which will increase the granularity of fault tolerance; there 
are trade-offs, and the user can choose how many tables a pipeline can pass 
through.
+
+## Schema Evolution
+![](/image/16714309762810/16714314658701.jpg)
+Schema Evolution is a feature that allows users to easily change the current 
schema of a table to accommodate changing data over time. Most commonly, it is 
used when performing an append or overwrite operation, to automatically adjust 
the schema to include one or more new columns.
+
+This feature is required for real-time data warehouse scenarios. Currently, 
the Flink and Spark engines do not support this feature.
+
+In SeaTunnel Engine, we will use the Chandy-Lamport algorithm to send DDL 
events, make them flow in the DAG graph and change the structure of each 
operator, and then synchronize them to the Sink.
+
+## Shared Resource
+![Shared-resource](/image/16714309762810/16714314806989.jpg)
+The Multi-table feature can reduce the use of some Source and Sink link 
resources. At the same time, we have implemented Dynamic Thread Resource 
Sharing in SeaTunnel Engine, reducing the resource usage of the engine on the 
server.
+
+## Speed Control
+![](/image/16714309762810/16714315001348.jpg)
+As for the problems that cannot be solved by the back pressure mechanism, we 
will optimize the Buffer and Checkpoint mechanism:
+
+* Firstly, We try to allow Buffer to control the amount of data in a period;
+* Secondly, by the Checkpoint mechanism, the engine can lock the buffer after 
the Checkpoint reaches the maximum number of parallelism and executes an 
interval time, prohibiting the writing of Source data, achieving the result of 
taking the pressure proactively, avoiding issues like back pressure delay or 
failure to be delivered to Source.
+The above is the design goal of SeaTunnel Engine, hoping to help you better 
solve the problems that bother you in data integration. In the future, we will 
continue to optimize the experience of using SeaTunnel so that more people are 
willing to use it.
+
+## The future of Apache SeaTunnel
+As an Apache incubator project, the Apache SeaTunnel community is developing 
rapidly. In the following community planning, we will focus on four directions:
+
+Support more data integration scenarios (Apache SeaTunnel Engine)
+It is used to solve the pain points that existing engines cannot solve, such 
as the synchronization of the entire database, the synchronization of table 
structure changes, and the large granularity of task failure;
+> Guys who are interested in the engine can pay attention to this Umbrella: 
https://github.com/apache/incubator-seatunnel/issues/2272
+
+Expand and improve Connector & Catalog ecology
+Support more Connector & Catalog, such as TiDB, Doris, Stripe, etc., and 
improve existing connectors, improve their usability and performance, etc.;
+Support CDC connector for real-time incremental synchronization scenarios.
+> Guys who are interested in connectors can pay attention to this Umbrella: 
https://github.com/apache/incubator-seatunnel/issues/1946
+
+Support for more versions of the engines
+Such as Spark 3.x, Flink 1.14.x, etc.
+> Guys who are interested in supporting Spark 3.3 can pay attention to this 
PR: https://github.com/apache/incubator-seatunnel/pull/2574
+
+Easier to use (Apache SeaTunnel Web)
+Provides a web interface to make operations more efficient in the form of 
DAG/SQL Simple and more intuitive display of Catalog, Connector, Job, etc.;
+Access to the scheduling platform to make task management easier
+> Guys who are interested in Web can pay attention to our Web sub-project: 
https://github.com/apache/incubator-seatunnel-web
\ No newline at end of file
diff --git a/static/image/16714309762810/16714309876928.jpg 
b/static/image/16714309762810/16714309876928.jpg
new file mode 100644
index 0000000000..212484d790
Binary files /dev/null and b/static/image/16714309762810/16714309876928.jpg 
differ
diff --git a/static/image/16714309762810/16714310892722.jpg 
b/static/image/16714309762810/16714310892722.jpg
new file mode 100644
index 0000000000..ac723b3156
Binary files /dev/null and b/static/image/16714309762810/16714310892722.jpg 
differ
diff --git a/static/image/16714309762810/16714310916195.jpg 
b/static/image/16714309762810/16714310916195.jpg
new file mode 100644
index 0000000000..d5162a0b09
Binary files /dev/null and b/static/image/16714309762810/16714310916195.jpg 
differ
diff --git a/static/image/16714309762810/16714310939883.jpg 
b/static/image/16714309762810/16714310939883.jpg
new file mode 100644
index 0000000000..74e3f37a27
Binary files /dev/null and b/static/image/16714309762810/16714310939883.jpg 
differ
diff --git a/static/image/16714309762810/16714311670656.jpg 
b/static/image/16714309762810/16714311670656.jpg
new file mode 100644
index 0000000000..9a8796d9ff
Binary files /dev/null and b/static/image/16714309762810/16714311670656.jpg 
differ
diff --git a/static/image/16714309762810/16714312426416.jpg 
b/static/image/16714309762810/16714312426416.jpg
new file mode 100644
index 0000000000..d6bdc20a10
Binary files /dev/null and b/static/image/16714309762810/16714312426416.jpg 
differ
diff --git a/static/image/16714309762810/16714312637015.jpg 
b/static/image/16714309762810/16714312637015.jpg
new file mode 100644
index 0000000000..9a17811251
Binary files /dev/null and b/static/image/16714309762810/16714312637015.jpg 
differ
diff --git a/static/image/16714309762810/16714312769761.jpg 
b/static/image/16714309762810/16714312769761.jpg
new file mode 100644
index 0000000000..f7d6503796
Binary files /dev/null and b/static/image/16714309762810/16714312769761.jpg 
differ
diff --git a/static/image/16714309762810/16714313100541.jpg 
b/static/image/16714309762810/16714313100541.jpg
new file mode 100644
index 0000000000..fc0e534baf
Binary files /dev/null and b/static/image/16714309762810/16714313100541.jpg 
differ
diff --git a/static/image/16714309762810/16714313301435.jpg 
b/static/image/16714309762810/16714313301435.jpg
new file mode 100644
index 0000000000..3b65e55ca1
Binary files /dev/null and b/static/image/16714309762810/16714313301435.jpg 
differ
diff --git a/static/image/16714309762810/16714313559400.jpg 
b/static/image/16714309762810/16714313559400.jpg
new file mode 100644
index 0000000000..0818c16509
Binary files /dev/null and b/static/image/16714309762810/16714313559400.jpg 
differ
diff --git a/static/image/16714309762810/16714313919617.jpg 
b/static/image/16714309762810/16714313919617.jpg
new file mode 100644
index 0000000000..82c01db755
Binary files /dev/null and b/static/image/16714309762810/16714313919617.jpg 
differ
diff --git a/static/image/16714309762810/16714314318184.jpg 
b/static/image/16714309762810/16714314318184.jpg
new file mode 100644
index 0000000000..6fe52e7b1e
Binary files /dev/null and b/static/image/16714309762810/16714314318184.jpg 
differ
diff --git a/static/image/16714309762810/16714314489916.jpg 
b/static/image/16714309762810/16714314489916.jpg
new file mode 100644
index 0000000000..4fa615dfa7
Binary files /dev/null and b/static/image/16714309762810/16714314489916.jpg 
differ
diff --git a/static/image/16714309762810/16714314658701.jpg 
b/static/image/16714309762810/16714314658701.jpg
new file mode 100644
index 0000000000..8e4b8203ea
Binary files /dev/null and b/static/image/16714309762810/16714314658701.jpg 
differ
diff --git a/static/image/16714309762810/16714314806989.jpg 
b/static/image/16714309762810/16714314806989.jpg
new file mode 100644
index 0000000000..7bc85d3951
Binary files /dev/null and b/static/image/16714309762810/16714314806989.jpg 
differ
diff --git a/static/image/16714309762810/16714315001348.jpg 
b/static/image/16714309762810/16714315001348.jpg
new file mode 100644
index 0000000000..c44079b03d
Binary files /dev/null and b/static/image/16714309762810/16714315001348.jpg 
differ
diff --git a/static/image/16714316310459/16714316482580.jpg 
b/static/image/16714316310459/16714316482580.jpg
new file mode 100644
index 0000000000..33cccae194
Binary files /dev/null and b/static/image/16714316310459/16714316482580.jpg 
differ
diff --git a/static/image/16714316310459/16714316839299.jpg 
b/static/image/16714316310459/16714316839299.jpg
new file mode 100644
index 0000000000..a85d7b984e
Binary files /dev/null and b/static/image/16714316310459/16714316839299.jpg 
differ
diff --git a/static/image/16714316310459/16714316928812.jpg 
b/static/image/16714316310459/16714316928812.jpg
new file mode 100644
index 0000000000..c07a1adbaf
Binary files /dev/null and b/static/image/16714316310459/16714316928812.jpg 
differ
diff --git a/static/image/16714316310459/16714317022389.jpg 
b/static/image/16714316310459/16714317022389.jpg
new file mode 100644
index 0000000000..e0c868405e
Binary files /dev/null and b/static/image/16714316310459/16714317022389.jpg 
differ
diff --git a/static/image/16714316310459/16714317067444.jpg 
b/static/image/16714316310459/16714317067444.jpg
new file mode 100644
index 0000000000..62d7c00e22
Binary files /dev/null and b/static/image/16714316310459/16714317067444.jpg 
differ
diff --git a/static/image/16714316310459/16714317166018.jpg 
b/static/image/16714316310459/16714317166018.jpg
new file mode 100644
index 0000000000..d354109093
Binary files /dev/null and b/static/image/16714316310459/16714317166018.jpg 
differ
diff --git a/static/image/16714316310459/16714317203806.jpg 
b/static/image/16714316310459/16714317203806.jpg
new file mode 100644
index 0000000000..b7da6ca44e
Binary files /dev/null and b/static/image/16714316310459/16714317203806.jpg 
differ
diff --git a/static/image/16714316310459/16714317262218.jpg 
b/static/image/16714316310459/16714317262218.jpg
new file mode 100644
index 0000000000..2132fc3dc1
Binary files /dev/null and b/static/image/16714316310459/16714317262218.jpg 
differ
diff --git a/static/image/16714316310459/16714317512435.jpg 
b/static/image/16714316310459/16714317512435.jpg
new file mode 100644
index 0000000000..69fc64480b
Binary files /dev/null and b/static/image/16714316310459/16714317512435.jpg 
differ
diff --git a/static/image/16714316310459/16714317679569.jpg 
b/static/image/16714316310459/16714317679569.jpg
new file mode 100644
index 0000000000..85a9618442
Binary files /dev/null and b/static/image/16714316310459/16714317679569.jpg 
differ
diff --git a/static/image/16714316310459/16714317930593.jpg 
b/static/image/16714316310459/16714317930593.jpg
new file mode 100644
index 0000000000..6bb719ba91
Binary files /dev/null and b/static/image/16714316310459/16714317930593.jpg 
differ
diff --git a/static/image/16714316310459/16714318216373.jpg 
b/static/image/16714316310459/16714318216373.jpg
new file mode 100644
index 0000000000..ad6c5c06d3
Binary files /dev/null and b/static/image/16714316310459/16714318216373.jpg 
differ
diff --git a/static/image/16714316310459/16714318381313.jpg 
b/static/image/16714316310459/16714318381313.jpg
new file mode 100644
index 0000000000..d865eed3b4
Binary files /dev/null and b/static/image/16714316310459/16714318381313.jpg 
differ
diff --git a/static/image/16714316310459/16714318550071.jpg 
b/static/image/16714316310459/16714318550071.jpg
new file mode 100644
index 0000000000..3e3a28f4fd
Binary files /dev/null and b/static/image/16714316310459/16714318550071.jpg 
differ
diff --git a/static/image/16714316310459/16714318796196.jpg 
b/static/image/16714316310459/16714318796196.jpg
new file mode 100644
index 0000000000..29d1ad3b35
Binary files /dev/null and b/static/image/16714316310459/16714318796196.jpg 
differ
diff --git a/static/image/16714316310459/16714318873362.jpg 
b/static/image/16714316310459/16714318873362.jpg
new file mode 100644
index 0000000000..f1b669c3e0
Binary files /dev/null and b/static/image/16714316310459/16714318873362.jpg 
differ
diff --git a/static/image/16714316310459/16714319281280.jpg 
b/static/image/16714316310459/16714319281280.jpg
new file mode 100644
index 0000000000..050fcfd029
Binary files /dev/null and b/static/image/16714316310459/16714319281280.jpg 
differ
diff --git a/static/image/16714316310459/16714319372730.jpg 
b/static/image/16714316310459/16714319372730.jpg
new file mode 100644
index 0000000000..0d1ea5d8a4
Binary files /dev/null and b/static/image/16714316310459/16714319372730.jpg 
differ
diff --git a/static/image/16714316310459/16714319569097.jpg 
b/static/image/16714316310459/16714319569097.jpg
new file mode 100644
index 0000000000..a53e0c3fa5
Binary files /dev/null and b/static/image/16714316310459/16714319569097.jpg 
differ
diff --git a/static/image/16714316310459/16714319599521.jpg 
b/static/image/16714316310459/16714319599521.jpg
new file mode 100644
index 0000000000..5131b3cab1
Binary files /dev/null and b/static/image/16714316310459/16714319599521.jpg 
differ
diff --git a/static/image/16714316310459/16714319862080.jpg 
b/static/image/16714316310459/16714319862080.jpg
new file mode 100644
index 0000000000..6e0d3d7048
Binary files /dev/null and b/static/image/16714316310459/16714319862080.jpg 
differ
diff --git a/static/image/16714316310459/16714319949478.jpg 
b/static/image/16714316310459/16714319949478.jpg
new file mode 100644
index 0000000000..cef4434bf2
Binary files /dev/null and b/static/image/16714316310459/16714319949478.jpg 
differ
diff --git a/static/image/16714316310459/16714320117277.jpg 
b/static/image/16714316310459/16714320117277.jpg
new file mode 100644
index 0000000000..260cf25c01
Binary files /dev/null and b/static/image/16714316310459/16714320117277.jpg 
differ
diff --git a/static/image/16714316310459/16714320201848.jpg 
b/static/image/16714316310459/16714320201848.jpg
new file mode 100644
index 0000000000..fb2619a644
Binary files /dev/null and b/static/image/16714316310459/16714320201848.jpg 
differ
diff --git a/static/image/16714316310459/16714320394193.jpg 
b/static/image/16714316310459/16714320394193.jpg
new file mode 100644
index 0000000000..eafabc115e
Binary files /dev/null and b/static/image/16714316310459/16714320394193.jpg 
differ
diff --git a/static/image/16714316310459/16714320856052.jpg 
b/static/image/16714316310459/16714320856052.jpg
new file mode 100644
index 0000000000..31ac4a3d94
Binary files /dev/null and b/static/image/16714316310459/16714320856052.jpg 
differ
diff --git a/static/image/16714316310459/16714321290339.jpg 
b/static/image/16714316310459/16714321290339.jpg
new file mode 100644
index 0000000000..92882c2fca
Binary files /dev/null and b/static/image/16714316310459/16714321290339.jpg 
differ
diff --git a/static/image/16714316310459/16714321480992.jpg 
b/static/image/16714316310459/16714321480992.jpg
new file mode 100644
index 0000000000..ab83fb493b
Binary files /dev/null and b/static/image/16714316310459/16714321480992.jpg 
differ
diff --git a/static/image/16714322747890/16714322908857.jpg 
b/static/image/16714322747890/16714322908857.jpg
new file mode 100644
index 0000000000..3aba641ff3
Binary files /dev/null and b/static/image/16714322747890/16714322908857.jpg 
differ
diff --git a/static/image/16714322747890/16714322944041.jpg 
b/static/image/16714322747890/16714322944041.jpg
new file mode 100644
index 0000000000..44d811ee8d
Binary files /dev/null and b/static/image/16714322747890/16714322944041.jpg 
differ
diff --git a/static/image/16714322747890/16714323322988.jpg 
b/static/image/16714322747890/16714323322988.jpg
new file mode 100644
index 0000000000..24da114f18
Binary files /dev/null and b/static/image/16714322747890/16714323322988.jpg 
differ
diff --git a/static/image/16714322747890/16714323668557.jpg 
b/static/image/16714322747890/16714323668557.jpg
new file mode 100644
index 0000000000..35c531ae4c
Binary files /dev/null and b/static/image/16714322747890/16714323668557.jpg 
differ
diff --git a/static/image/16714322747890/16714323741126.jpg 
b/static/image/16714322747890/16714323741126.jpg
new file mode 100644
index 0000000000..fc697ee803
Binary files /dev/null and b/static/image/16714322747890/16714323741126.jpg 
differ
diff --git a/static/image/16714322747890/16714323846302.jpg 
b/static/image/16714322747890/16714323846302.jpg
new file mode 100644
index 0000000000..60cca70f34
Binary files /dev/null and b/static/image/16714322747890/16714323846302.jpg 
differ
diff --git a/static/image/16714322747890/16714323920717.jpg 
b/static/image/16714322747890/16714323920717.jpg
new file mode 100644
index 0000000000..63959d6735
Binary files /dev/null and b/static/image/16714322747890/16714323920717.jpg 
differ
diff --git a/static/image/16714322747890/16714324136843.jpg 
b/static/image/16714322747890/16714324136843.jpg
new file mode 100644
index 0000000000..2075d9126d
Binary files /dev/null and b/static/image/16714322747890/16714324136843.jpg 
differ
diff --git a/static/image/16714322747890/16714324237449.jpg 
b/static/image/16714322747890/16714324237449.jpg
new file mode 100644
index 0000000000..86e97525b5
Binary files /dev/null and b/static/image/16714322747890/16714324237449.jpg 
differ
diff --git a/static/image/16714322747890/16714324336701.jpg 
b/static/image/16714322747890/16714324336701.jpg
new file mode 100644
index 0000000000..159e263fc4
Binary files /dev/null and b/static/image/16714322747890/16714324336701.jpg 
differ
diff --git a/static/image/16714322747890/16714324454477.jpg 
b/static/image/16714322747890/16714324454477.jpg
new file mode 100644
index 0000000000..288b685b5e
Binary files /dev/null and b/static/image/16714322747890/16714324454477.jpg 
differ
diff --git a/static/image/16714322747890/16714324726276.jpg 
b/static/image/16714322747890/16714324726276.jpg
new file mode 100644
index 0000000000..01654ebcad
Binary files /dev/null and b/static/image/16714322747890/16714324726276.jpg 
differ
diff --git a/static/image/16714322747890/16714325444078.jpg 
b/static/image/16714322747890/16714325444078.jpg
new file mode 100644
index 0000000000..d88d718660
Binary files /dev/null and b/static/image/16714322747890/16714325444078.jpg 
differ
diff --git a/static/image/16714322747890/16714325474972.jpg 
b/static/image/16714322747890/16714325474972.jpg
new file mode 100644
index 0000000000..fc9a1e059d
Binary files /dev/null and b/static/image/16714322747890/16714325474972.jpg 
differ
diff --git a/static/image/16714322747890/16714325600316.jpg 
b/static/image/16714322747890/16714325600316.jpg
new file mode 100644
index 0000000000..411096f469
Binary files /dev/null and b/static/image/16714322747890/16714325600316.jpg 
differ
diff --git a/static/image/16714322747890/16714325681738.jpg 
b/static/image/16714322747890/16714325681738.jpg
new file mode 100644
index 0000000000..f58fc88d6e
Binary files /dev/null and b/static/image/16714322747890/16714325681738.jpg 
differ
diff --git a/static/image/16714322747890/16714325887598.jpg 
b/static/image/16714322747890/16714325887598.jpg
new file mode 100644
index 0000000000..f2417cb7b1
Binary files /dev/null and b/static/image/16714322747890/16714325887598.jpg 
differ
diff --git a/static/image/16714322747890/16714326227360.jpg 
b/static/image/16714322747890/16714326227360.jpg
new file mode 100644
index 0000000000..ea5beaada8
Binary files /dev/null and b/static/image/16714322747890/16714326227360.jpg 
differ
diff --git a/static/image/16714322747890/16714326749427.jpg 
b/static/image/16714322747890/16714326749427.jpg
new file mode 100644
index 0000000000..ed447d7baa
Binary files /dev/null and b/static/image/16714322747890/16714326749427.jpg 
differ
diff --git a/static/image/16714322747890/16714327002726.jpg 
b/static/image/16714322747890/16714327002726.jpg
new file mode 100644
index 0000000000..babfd87464
Binary files /dev/null and b/static/image/16714322747890/16714327002726.jpg 
differ
diff --git a/static/image/16714322747890/16714327218590.jpg 
b/static/image/16714322747890/16714327218590.jpg
new file mode 100644
index 0000000000..3156555483
Binary files /dev/null and b/static/image/16714322747890/16714327218590.jpg 
differ

[incubator-seatunnel-website] branch main updated: Add Blog (#180)

Reply via email to