Re: [PR] blog [seatunnel-website]

via GitHub Wed, 19 Mar 2025 01:21:23 -0700


zhangshenghang commented on code in PR #368:
URL: https://github.com/apache/seatunnel-website/pull/368#discussion_r2002709688



##########
blog/2025-03-19-Data_Pipeline_Tutorial_Synchronizing_from_MySQL_to_PostgreSQL_Based_on_Apache_SeaTunnel.md:
##########
@@ -0,0 +1,341 @@
+# Data Pipeline Tutorial：Synchronizing from MySQL to PostgreSQL Based on 
Apache SeaTunnel
+
+
+This article provides a detailed walkthrough of how to achieve full data 
synchronization from MySQL to PostgreSQL using **Apache SeaTunnel 2.3.9**. We 
cover the complete end-to-end process — from environment setup to production 
validation. Let’s dive into the MySQL-to-PostgreSQL synchronization scenario.
+
+Version Requirements：
+
+-   **MySQL：** MySQL 8.3
+-   **PostgreSQL：** PostgreSQL 13.2
+-   **Apache SeaTunnel：** Apache-SeaTunnel-2.3.9
+
+# Preliminaries
+
+# Verify Version Information
+
+Run the following SQL command to check the version：
+```
+-- Check version information 
+select version();
+```
+
+# Enable Master-Slave Replication
+
+```
+-- View replication-related variables
+show variables where variable_name in ('log_bin', 'binlog_format', 
'binlog_row_image', 'gtid_mode', 'enforce_gtid_consistency');
+```
+
+
+For MySQL CDC data synchronization, SeaTunnel needs to read the 
MySQL`binlog`and act as a slave node in the MySQL cluster.  
+
+_Note： In MySQL 8.0+,`binlog`is enabled by default, but replication mode must 
be enabled manually._
+
+```
+-- Enable master-slave replication (execute in sequence)
+-- SET GLOBAL gtid_mode=OFF;
+-- SET GLOBAL enforce_gtid_consistency=OFF;
+SET GLOBAL gtid_mode=OFF_PERMISSIVE;
+SET GLOBAL gtid_mode=ON_PERMISSIVE;
+SET GLOBAL enforce_gtid_consistency=ON;
+SET GLOBAL gtid_mode=ON;
+```
+
+
+# Grant Necessary User Permissions
+
+A user must have`REPLICATION SLAVE`and`REPLICATION CLIENT`privileges：
+
+```
+-- Grant privileges to the user
+CREATE USER 'test'@'%' IDENTIFIED BY 'password';
+GRANT SELECT, RELOAD, SHOW DATABASES, REPLICATION SLAVE, REPLICATION CLIENT ON 
*.* TO 'test';
+FLUSH PRIVILEGES;
+```
+
+# SeaTunnel Cluster Setup
+
+## Cluster Logging
+
+By default, SeaTunnel logs output to a single file. For production, it’s 
preferable to have separate log files per job. Update the logging configuration 
in`log4j2.properties`：
+
+```
+############################ log output to file #############################
+# rootLogger.appenderRef.file.ref = fileAppender
+# Change log output to use independent log files for each job
+rootLogger.appenderRef.file.ref = routingAppender
+############################ log output to file #############################
+```
+
+## Client Configuration
+
+For production clusters, it is recommended to install SeaTunnel under 
the`/opt`directory and point the`SEATUNNEL_HOME`environment variable 
accordingly.
+
+If multiple versions exist, create a symbolic link to align with the server 
deployment directory：
+
+```
+# Create a symlink
+ln -s /opt/apache-seatunnel-2.3.9 /opt/seatunnel
+# Set environment variable
+export SEATUNNEL_HOME=/opt/seatunnel
+```
+
+## Environment Variables Configuration
+
+For Linux servers, add the following lines to`/etc/profile.d/seatunnel.sh`：
+
+```
+echo 'export SEATUNNEL_HOME=/opt/seatunnel' >> /etc/profile.d/seatunnel.sh
+echo 'export PATH=$SEATUNNEL_HOME/bin：$PATH' >> /etc/profile.d/seatunnel.sh
+source /etc/profile.d/seatunnel.sh
+```
+
+# Job Configuration
+
+Note： The configuration below does not cover all options but illustrates 
common production settings.
+
+```
+env {
+  job.mode = "STREAMING"
+  job.name = "DEMO"
+  parallelism = 3
+  checkpoint.interval = 30000  # 30 seconds
+  checkpoint.timeout = 30000   # 30 seconds
+  
+  job.retry.times = 3 
+  job.retry.interval.seconds = 3  # 3 seconds
+}
+```
+The first step is setting up the`env`module, which operates in a streaming 
mode. Therefore, it’s essential to specify the configuration mode as`STREAMING`.
+
+## Task Naming and Management
+
+Configuring a task name is crucial for identifying and managing jobs in a 
production environment. Naming conventions based on database or table names can 
help with monitoring and administration.
+
+## Parallelism Settings
+
+Here, we set the parallelism to**3**, but this value can be adjusted based on 
the cluster size and database performance.
+
+## Checkpoint Configuration
+
+-   **Checkpoint Frequency**： Set to**30 seconds**. If higher precision is 
required, this can be reduced to**10 seconds**or less.
+-   **Checkpoint Timeout**： If a checkpoint takes too long, the job is 
considered failed. Set to**30 seconds**.
+-   **Automatic Retry**： Configured to**3 retries**, with a retry interval 
of**3 seconds**(adjustable based on system requirements).
+
+
+```
+source {
+  MySQL-CDC {
+    base-url = 
"jdbc：mysql：//192.168.8.101：3306/test?serverTimezone=Asia/Shanghai"
+    username = "test"
+    password = "123456"
+    
+    database-names = ["test"]
+    # table-names = ["test.test_001","test.test_002"]
+    table-pattern = "test\\.test_.*"  # The first dot is a literal character, 
requiring escaping; the second dot represents any single character.
+    table-names-config = [
+        {"table"："test.test_002","primaryKeys"：["id"]}
+        ]
+    
+    startup.mode = "initial" # First sync all historical data, then 
incremental updates
+    snapshot.split.size = "8096" 
+    snapshot.fetch.size = "1024"
+    server-id = "6500-8500"
+    connect.timeout.ms = 30000
+    connect.max-retries = 3
+    connection.pool.size = 20
+    
+    exactly_once = false   # In analytical scenarios, disabling exactly-once 
consistency allows some duplicates/losses for better performance.
+    schema-changes.enabled = true # Enable schema evolution to avoid frequent 
modifications; supports add, rename, drop operations.
+  }
+}
+```
+
+
+
+# Key MySQL CDC Configurations
+
+1.  **Time Zone Configuration**： It’s recommended to specify the MySQL 
connection timezone to prevent discrepancies when 
extracting`datetime`or`timestamp`data.
+2.  **User Credentials**：
+
+-   The **username** and **password** must have **replication privileges** , 
allowing access to the **bin_log** logs.
+-   The account should be able to query all tables under the designated 
databases.
+
+## Database & Table Selection
+
+Typically, each database is assigned to a separate task. Here, we specify only 
the`test`database.  
+Two methods can be used：
+
+1.  **Direct table name selection**
+2.  **Regular expression-based table matching**(recommended for large datasets 
or entire database synchronization).
+
+> **Important：**  
+> When using **regular expressions**, both the **database name** and **table 
name** must be included. The`.`character, which separates them, must be escaped 
(`\\.`).
+
+For example, to match tables prefixed with`test_`, we use：
+```
+test\\.test_.*
+```
+
+-   The first dot (`.`) represents a literal separator, requiring escaping 
(`\\.`).
+-   The second dot (`.`) represents**any single character**in regex.
+
+Additionally, for tables **without primary keys**, logical primary keys can be 
specified manually to facilitate data synchronization.
+
+# Startup Mode
+
+The default startup mode is**“initial”**, which means：

Review Comment:
   There is a problem with the display here. <img width="90" alt="image" 
src="https://github.com/user-attachments/assets/c9172842-faa8-4168-baca-b96a73263dea";
 />
   
   ```suggestion
   The default startup mode is**initial**, which means：
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] blog [seatunnel-website]

Reply via email to