(seatunnel-website) branch main updated: blog (#368)

lidongdai Tue, 30 Dec 2025 03:58:51 -0800

This is an automated email from the ASF dual-hosted git repository.

lidongdai pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/seatunnel-website.git



The following commit(s) were added to refs/heads/main by this push:
     new 5451f97c3ec1 blog (#368)
5451f97c3ec1 is described below

commit 5451f97c3ec1b8933e1260c6f6c0e777b4dbeeb3
Author: Niko-Zeng <[email protected]>
AuthorDate: Tue Dec 30 19:58:40 2025 +0800

    blog (#368)
    
    Co-authored-by: David Zollo <[email protected]>
    Co-authored-by: Jast <[email protected]>
---
 ...ySQL_to_PostgreSQL_Based_on_Apache_SeaTunnel.md | 341 +++++++++++++++++++++
 1 file changed, 341 insertions(+)

diff --git 
a/blog/2025-03-19-Data_Pipeline_Tutorial_Synchronizing_from_MySQL_to_PostgreSQL_Based_on_Apache_SeaTunnel.md
 
b/blog/2025-03-19-Data_Pipeline_Tutorial_Synchronizing_from_MySQL_to_PostgreSQL_Based_on_Apache_SeaTunnel.md
new file mode 100644
index 000000000000..cb2e4031613d
--- /dev/null
+++ 
b/blog/2025-03-19-Data_Pipeline_Tutorial_Synchronizing_from_MySQL_to_PostgreSQL_Based_on_Apache_SeaTunnel.md
@@ -0,0 +1,341 @@
+# Data Pipeline Tutorial：Synchronizing from MySQL to PostgreSQL Based on 
Apache SeaTunnel
+
+
+This article provides a detailed walkthrough of how to achieve full data 
synchronization from MySQL to PostgreSQL using **Apache SeaTunnel 2.3.9**. We 
cover the complete end-to-end process — from environment setup to production 
validation. Let’s dive into the MySQL-to-PostgreSQL synchronization scenario.
+
+Version Requirements：
+
+-   **MySQL：** MySQL 8.3
+-   **PostgreSQL：** PostgreSQL 13.2
+-   **Apache SeaTunnel：** Apache-SeaTunnel-2.3.9
+
+# Preliminaries
+
+# Verify Version Information
+
+Run the following SQL command to check the version：
+```
+-- Check version information 
+select version();
+```
+
+# Enable Master-Slave Replication
+
+```
+-- View replication-related variables
+show variables where variable_name in ('log_bin', 'binlog_format', 
'binlog_row_image', 'gtid_mode', 'enforce_gtid_consistency');
+```
+
+
+For MySQL CDC data synchronization, SeaTunnel needs to read the 
MySQL`binlog`and act as a slave node in the MySQL cluster.  
+
+_Note： In MySQL 8.0+,`binlog`is enabled by default, but replication mode must 
be enabled manually._
+
+```
+-- Enable master-slave replication (execute in sequence)
+-- SET GLOBAL gtid_mode=OFF;
+-- SET GLOBAL enforce_gtid_consistency=OFF;
+SET GLOBAL gtid_mode=OFF_PERMISSIVE;
+SET GLOBAL gtid_mode=ON_PERMISSIVE;
+SET GLOBAL enforce_gtid_consistency=ON;
+SET GLOBAL gtid_mode=ON;
+```
+
+
+# Grant Necessary User Permissions
+
+A user must have`REPLICATION SLAVE`and`REPLICATION CLIENT`privileges：
+
+```
+-- Grant privileges to the user
+CREATE USER 'test'@'%' IDENTIFIED BY 'password';
+GRANT SELECT, RELOAD, SHOW DATABASES, REPLICATION SLAVE, REPLICATION CLIENT ON 
*.* TO 'test';
+FLUSH PRIVILEGES;
+```
+
+# SeaTunnel Cluster Setup
+
+## Cluster Logging
+
+By default, SeaTunnel logs output to a single file. For production, it’s 
preferable to have separate log files per job. Update the logging configuration 
in`log4j2.properties`：
+
+```
+############################ log output to file #############################
+# rootLogger.appenderRef.file.ref = fileAppender
+# Change log output to use independent log files for each job
+rootLogger.appenderRef.file.ref = routingAppender
+############################ log output to file #############################
+```
+
+## Client Configuration
+
+For production clusters, it is recommended to install SeaTunnel under 
the`/opt`directory and point the`SEATUNNEL_HOME`environment variable 
accordingly.
+
+If multiple versions exist, create a symbolic link to align with the server 
deployment directory：
+
+```
+# Create a symlink
+ln -s /opt/apache-seatunnel-2.3.9 /opt/seatunnel
+# Set environment variable
+export SEATUNNEL_HOME=/opt/seatunnel
+```
+
+## Environment Variables Configuration
+
+For Linux servers, add the following lines to`/etc/profile.d/seatunnel.sh`：
+
+```
+echo 'export SEATUNNEL_HOME=/opt/seatunnel' >> /etc/profile.d/seatunnel.sh
+echo 'export PATH=$SEATUNNEL_HOME/bin：$PATH' >> /etc/profile.d/seatunnel.sh
+source /etc/profile.d/seatunnel.sh
+```
+
+# Job Configuration
+
+Note： The configuration below does not cover all options but illustrates 
common production settings.
+
+```
+env {
+  job.mode = "STREAMING"
+  job.name = "DEMO"
+  parallelism = 3
+  checkpoint.interval = 30000  # 30 seconds
+  checkpoint.timeout = 30000   # 30 seconds
+  
+  job.retry.times = 3 
+  job.retry.interval.seconds = 3  # 3 seconds
+}
+```
+The first step is setting up the`env`module, which operates in a streaming 
mode. Therefore, it’s essential to specify the configuration mode as`STREAMING`.
+
+## Task Naming and Management
+
+Configuring a task name is crucial for identifying and managing jobs in a 
production environment. Naming conventions based on database or table names can 
help with monitoring and administration.
+
+## Parallelism Settings
+
+Here, we set the parallelism to**3**, but this value can be adjusted based on 
the cluster size and database performance.
+
+## Checkpoint Configuration
+
+-   **Checkpoint Frequency**： Set to **30 seconds**. If higher precision is 
required, this can be reduced to **10 seconds**or less.
+-   **Checkpoint Timeout**： If a checkpoint takes too long, the job is 
considered failed. Set to**30 seconds**.
+-   **Automatic Retry**： Configured to**3 retries**, with a retry interval 
of**3 seconds**(adjustable based on system requirements).
+
+
+```
+source {
+  MySQL-CDC {
+    base-url = 
"jdbc：mysql：//192.168.8.101：3306/test?serverTimezone=Asia/Shanghai"
+    username = "test"
+    password = "123456"
+    
+    database-names = ["test"]
+    # table-names = ["test.test_001","test.test_002"]
+    table-pattern = "test\\.test_.*"  # The first dot is a literal character, 
requiring escaping; the second dot represents any single character.
+    table-names-config = [
+        {"table"："test.test_002","primaryKeys"：["id"]}
+        ]
+    
+    startup.mode = "initial" # First sync all historical data, then 
incremental updates
+    snapshot.split.size = "8096" 
+    snapshot.fetch.size = "1024"
+    server-id = "6500-8500"
+    connect.timeout.ms = 30000
+    connect.max-retries = 3
+    connection.pool.size = 20
+    
+    exactly_once = false   # In analytical scenarios, disabling exactly-once 
consistency allows some duplicates/losses for better performance.
+    schema-changes.enabled = true # Enable schema evolution to avoid frequent 
modifications; supports add, rename, drop operations.
+  }
+}
+```
+
+
+
+# Key MySQL CDC Configurations
+
+1.  **Time Zone Configuration**： It’s recommended to specify the MySQL 
connection timezone to prevent discrepancies when 
extracting`datetime`or`timestamp`data.
+2.  **User Credentials**：
+
+-   The **username** and **password** must have **replication privileges** , 
allowing access to the **bin_log** logs.
+-   The account should be able to query all tables under the designated 
databases.
+
+## Database & Table Selection
+
+Typically, each database is assigned to a separate task. Here, we specify only 
the`test`database.  
+Two methods can be used：
+
+1.  **Direct table name selection**
+2.  **Regular expression-based table matching**(recommended for large datasets 
or entire database synchronization).
+
+> **Important：**  
+> When using **regular expressions**, both the **database name** and **table 
name** must be included. The`.`character, which separates them, must be escaped 
(`\\.`).
+
+For example, to match tables prefixed with`test_`, we use：
+```
+test\\.test_.*
+```
+
+-   The first dot (`.`) represents a literal separator, requiring escaping 
(`\\.`).
+-   The second dot (`.`) represents**any single character**in regex.
+
+Additionally, for tables **without primary keys**, logical primary keys can be 
specified manually to facilitate data synchronization.
+
+# Startup Mode
+
+The default startup mode is**initial**, which means：
+
+1.  **Full historical data sync**first
+2.  **Incremental updates**afterward
+
+**Sharding & Fetching**
+
+-   The **default values** for **shard size** and **batch fetch size** work 
well.
+-   If the server has **higher performance**, these values can be increased.
+
+**Server ID**
+
+-   MySQL requires **unique server IDs** for replication nodes.
+-   Apache SeaTunnel must **masquerade as a MySQL replica**.
+-   If not configured, a default value is used, but **manual specification is 
recommended** to avoid conflicts.
+-   The **server ID range must be greater than the parallelism level**, or 
errors may occur.
+
+# Timeouts & Retries
+
+-   **Connection Timeout**： For large datasets, increase this value 
accordingly.
+-   **Auto-Retry Interval**： If handling a high volume of tables, consider 
extending retry intervals.
+
+# Exactly-Once Consistency
+
+For **CDC-based data synchronization**,**exactly-once consistency** is often 
**not required** in analytical scenarios.
+
+-   **Disabling**it can significantly **boost performance**.
+-   However, if strict consistency is required, it **can be enabled** at the 
cost of reduced performance.
+
+# Schema Evolution
+
+It’s **highly recommended** to **enable schema evolution**, which：
+
+-   Allows **automatic table modifications** (e.g., adding/removing fields)
+-   Reduces the need for **manual job updates** when the schema changes
+
+However,**downstream tasks** may fail if they rely on a field that was 
modified.  
+Supported schema changes：  
+✔️`ADD COLUMN`  
+✔️`DROP COLUMN`  
+✔️`RENAME COLUMN`  
+✔️`MODIFY COLUMN`
+
+> **Note：** Schema evolution **does not support**`CREATE TABLE`or`DROP TABLE`.
+
+# Configuring the Sink (PostgreSQL)
+
+The **sink** configuration inserts data into **PostgreSQL**.
+```
+sink {
+  jdbc {
+        url = "jdbc：postgresql：//192.168.8.101：5432/test"
+        driver = "org.postgresql.Driver"
+        user = "postgres"
+        password = "123456"
+        
+        generate_sink_sql = true
+        database = "test"
+        table = "${database_name}.${table_name}"
+        schema_save_mode = "CREATE_SCHEMA_WHEN_NOT_EXIST"
+        data_save_mode = "APPEND_DATA"
+        # enable_upsert = false
+    }
+}
+```
+
+**Key Considerations：**
+
+**JDBC Connection**：
+
+-   Specify PostgreSQL **driver, user, and password**.
+
+**Auto SQL Generation**：
+
+-   Enabling `generate_sink_sql` lets SeaTunnel automatically create tables 
and generate `INSERT`,`DELETE`, and`UPDATE`statements.
+
+**Schema Handling**：
+
+-   PostgreSQL uses **Database → Schema → Table**, while MySQL has only 
**Database → Table**.
+-   Ensure the **schema is correctly mapped** to avoid data mismatches.
+
+**User Permissions**：
+
+-   The PostgreSQL **user must have table creation permissions** if using 
auto-schema generation.
+
+> **_For more details, refer to the official documentation：_**  
+> _🔗_[_SeaTunnel MySQL-CDC Connector 
Docs_](https：//seatunnel.apache.org/docs/2.3.9/connector-v2/source/MySQL-CDC/)
+
+# Using Placeholders in Sink Configuration
+
+Apache SeaTunnel supports **placeholders**, which dynamically adjust table 
names based on the source data.
+
+For example：
+```
+table = "${database\_name}.${table\_name}"
+```
+-   Ensures each table syncs correctly without **manual specification**.
+-   Supports **concatenation and dynamic formatting**.
+
+# Schema Save Mode and Data Append Strategy
+
+The`schema_save_mode`parameter plays a crucial role in **database-wide 
synchronization**. It simplifies the process by **automatically creating 
tables** in the target database, eliminating the need for manual table creation 
steps.
+
+Another key configuration is`APPEND_DATA`, which is particularly useful when 
the target database already contains **previously synchronized data**. This 
setting **prevents the accidental deletion of existing records** , making it a 
**safer** choice for most scenarios. However, if your use case requires a 
different approach, you can modify this setting according to the **official 
documentation guidelines**.
+
+# Enable Upsert for Performance Optimization
+
+Another important parameter is`enable_upsert`. If you **can guarantee that the 
source data contains no duplicate records** , disabling upsert (`enable_upsert 
= false`) can **significantly enhance synchronization performance** . This is 
because, without upsert, the system does not need to check for existing records 
before inserting new ones.
+
+However, if there is a possibility of duplicate records in the source data, it 
is **strongly recommended** to keep **Upsert enabled (`enable_upsert = true`). 
This ensures that records are inserted or updated based on their **primary 
key** , preventing duplication issues.
+
+For **detailed parameter explanations and further customization options** , 
please refer to the **official Apache SeaTunnel documentation**.
+
+# Task Submission and Monitoring
+
+Once your configuration file is ready, submit the job using the SeaTunnel 
command-line tool：
+
+```
+./bin/start-seatunnel.sh --config /path/to/config.yaml --async
+```
+
+**Key Parameters：**
+
+-   `--config`： Specifies the path to your configuration file.
+-   `--async`： Submits the job asynchronously, allowing the command line to 
exit while the job continues in the background.
+
+After submission, you can monitor the job via SeaTunnel’s cluster UI. In 
version 2.3.9, SeaTunnel provides a visual interface where you can view job 
logs, execution status, and data throughput details.
+
+# Data Synchronization Demonstration
+
+For this demonstration, we created two tables (`test_001`and`test_002`) and 
inserted sample data into MySQL. Using SeaTunnel's synchronization tasks, the 
data was successfully synchronized to PostgreSQL. The demonstration included 
insertions, deletions, updates, and even table schema modifications—all of 
which were reflected in real time on PostgreSQL.
+
+**Key Points：**
+
+-   **Schema Synchronization：**  
+    SeaTunnel supports automatic table schema synchronization. When the source 
MySQL table structure changes, the target PostgreSQL table automatically 
updates.
+-   **Data Consistency：**  
+    SeaTunnel ensures data consistency by accurately synchronizing all insert, 
delete, and update operations to the target database.
+
+# About SeaTunnel
+
+Apache SeaTunnel focuses on data integration and synchronization, addressing 
common challenges such as：
+
+-   **Diverse Data Sources：**  
+    Supporting hundreds of data sources, even as new ones emerge.
+-   **Complex Sync Scenarios：**  
+    Including full, incremental, CDC, real-time, and whole-database 
synchronizations.
+-   **High Resource Demands：**  
+    Traditional tools often require extensive computing or JDBC resources for 
real-time sync of many small tables.
+-   **Monitoring and Quality：**  
+    Sync processes can suffer from data loss or duplication, and effective 
monitoring is essential.
+-   **Complex Technology Stacks：**  
+    Multiple sync programs may be needed for different systems.
+-   **Management Challenges：**  
+    Offline and real-time sync are often developed and managed separately, 
increasing complexity.
\ No newline at end of file

(seatunnel-website) branch main updated: blog (#368)

Reply via email to