zhangshenghang commented on code in PR #368:
URL: https://github.com/apache/seatunnel-website/pull/368#discussion_r2002709688
##########
blog/2025-03-19-Data_Pipeline_Tutorial_Synchronizing_from_MySQL_to_PostgreSQL_Based_on_Apache_SeaTunnel.md:
##########
@@ -0,0 +1,341 @@
+# Data Pipeline Tutorial:Synchronizing from MySQL to PostgreSQL Based on
Apache SeaTunnel
+
+
+This article provides a detailed walkthrough of how to achieve full data
synchronization from MySQL to PostgreSQL using **Apache SeaTunnel 2.3.9**. We
cover the complete end-to-end process — from environment setup to production
validation. Let’s dive into the MySQL-to-PostgreSQL synchronization scenario.
+
+Version Requirements:
+
+- **MySQL:** MySQL 8.3
+- **PostgreSQL:** PostgreSQL 13.2
+- **Apache SeaTunnel:** Apache-SeaTunnel-2.3.9
+
+# Preliminaries
+
+# Verify Version Information
+
+Run the following SQL command to check the version:
+```
+-- Check version information
+select version();
+```
+
+# Enable Master-Slave Replication
+
+```
+-- View replication-related variables
+show variables where variable_name in ('log_bin', 'binlog_format',
'binlog_row_image', 'gtid_mode', 'enforce_gtid_consistency');
+```
+
+
+For MySQL CDC data synchronization, SeaTunnel needs to read the
MySQL`binlog`and act as a slave node in the MySQL cluster.
+
+_Note: In MySQL 8.0+,`binlog`is enabled by default, but replication mode must
be enabled manually._
+
+```
+-- Enable master-slave replication (execute in sequence)
+-- SET GLOBAL gtid_mode=OFF;
+-- SET GLOBAL enforce_gtid_consistency=OFF;
+SET GLOBAL gtid_mode=OFF_PERMISSIVE;
+SET GLOBAL gtid_mode=ON_PERMISSIVE;
+SET GLOBAL enforce_gtid_consistency=ON;
+SET GLOBAL gtid_mode=ON;
+```
+
+
+# Grant Necessary User Permissions
+
+A user must have`REPLICATION SLAVE`and`REPLICATION CLIENT`privileges:
+
+```
+-- Grant privileges to the user
+CREATE USER 'test'@'%' IDENTIFIED BY 'password';
+GRANT SELECT, RELOAD, SHOW DATABASES, REPLICATION SLAVE, REPLICATION CLIENT ON
*.* TO 'test';
+FLUSH PRIVILEGES;
+```
+
+# SeaTunnel Cluster Setup
+
+## Cluster Logging
+
+By default, SeaTunnel logs output to a single file. For production, it’s
preferable to have separate log files per job. Update the logging configuration
in`log4j2.properties`:
+
+```
+############################ log output to file #############################
+# rootLogger.appenderRef.file.ref = fileAppender
+# Change log output to use independent log files for each job
+rootLogger.appenderRef.file.ref = routingAppender
+############################ log output to file #############################
+```
+
+## Client Configuration
+
+For production clusters, it is recommended to install SeaTunnel under
the`/opt`directory and point the`SEATUNNEL_HOME`environment variable
accordingly.
+
+If multiple versions exist, create a symbolic link to align with the server
deployment directory:
+
+```
+# Create a symlink
+ln -s /opt/apache-seatunnel-2.3.9 /opt/seatunnel
+# Set environment variable
+export SEATUNNEL_HOME=/opt/seatunnel
+```
+
+## Environment Variables Configuration
+
+For Linux servers, add the following lines to`/etc/profile.d/seatunnel.sh`:
+
+```
+echo 'export SEATUNNEL_HOME=/opt/seatunnel' >> /etc/profile.d/seatunnel.sh
+echo 'export PATH=$SEATUNNEL_HOME/bin:$PATH' >> /etc/profile.d/seatunnel.sh
+source /etc/profile.d/seatunnel.sh
+```
+
+# Job Configuration
+
+Note: The configuration below does not cover all options but illustrates
common production settings.
+
+```
+env {
+ job.mode = "STREAMING"
+ job.name = "DEMO"
+ parallelism = 3
+ checkpoint.interval = 30000 # 30 seconds
+ checkpoint.timeout = 30000 # 30 seconds
+
+ job.retry.times = 3
+ job.retry.interval.seconds = 3 # 3 seconds
+}
+```
+The first step is setting up the`env`module, which operates in a streaming
mode. Therefore, it’s essential to specify the configuration mode as`STREAMING`.
+
+## Task Naming and Management
+
+Configuring a task name is crucial for identifying and managing jobs in a
production environment. Naming conventions based on database or table names can
help with monitoring and administration.
+
+## Parallelism Settings
+
+Here, we set the parallelism to**3**, but this value can be adjusted based on
the cluster size and database performance.
+
+## Checkpoint Configuration
+
+- **Checkpoint Frequency**: Set to**30 seconds**. If higher precision is
required, this can be reduced to**10 seconds**or less.
+- **Checkpoint Timeout**: If a checkpoint takes too long, the job is
considered failed. Set to**30 seconds**.
+- **Automatic Retry**: Configured to**3 retries**, with a retry interval
of**3 seconds**(adjustable based on system requirements).
+
+
+```
+source {
+ MySQL-CDC {
+ base-url =
"jdbc:mysql://192.168.8.101:3306/test?serverTimezone=Asia/Shanghai"
+ username = "test"
+ password = "123456"
+
+ database-names = ["test"]
+ # table-names = ["test.test_001","test.test_002"]
+ table-pattern = "test\\.test_.*" # The first dot is a literal character,
requiring escaping; the second dot represents any single character.
+ table-names-config = [
+ {"table":"test.test_002","primaryKeys":["id"]}
+ ]
+
+ startup.mode = "initial" # First sync all historical data, then
incremental updates
+ snapshot.split.size = "8096"
+ snapshot.fetch.size = "1024"
+ server-id = "6500-8500"
+ connect.timeout.ms = 30000
+ connect.max-retries = 3
+ connection.pool.size = 20
+
+ exactly_once = false # In analytical scenarios, disabling exactly-once
consistency allows some duplicates/losses for better performance.
+ schema-changes.enabled = true # Enable schema evolution to avoid frequent
modifications; supports add, rename, drop operations.
+ }
+}
+```
+
+
+
+# Key MySQL CDC Configurations
+
+1. **Time Zone Configuration**: It’s recommended to specify the MySQL
connection timezone to prevent discrepancies when
extracting`datetime`or`timestamp`data.
+2. **User Credentials**:
+
+- The **username** and **password** must have **replication privileges** ,
allowing access to the **bin_log** logs.
+- The account should be able to query all tables under the designated
databases.
+
+## Database & Table Selection
+
+Typically, each database is assigned to a separate task. Here, we specify only
the`test`database.
+Two methods can be used:
+
+1. **Direct table name selection**
+2. **Regular expression-based table matching**(recommended for large datasets
or entire database synchronization).
+
+> **Important:**
+> When using **regular expressions**, both the **database name** and **table
name** must be included. The`.`character, which separates them, must be escaped
(`\\.`).
+
+For example, to match tables prefixed with`test_`, we use:
+```
+test\\.test_.*
+```
+
+- The first dot (`.`) represents a literal separator, requiring escaping
(`\\.`).
+- The second dot (`.`) represents**any single character**in regex.
+
+Additionally, for tables **without primary keys**, logical primary keys can be
specified manually to facilitate data synchronization.
+
+# Startup Mode
+
+The default startup mode is**“initial”**, which means:
Review Comment:
There is a problem with the display here. <img width="90" alt="image"
src="https://github.com/user-attachments/assets/c9172842-faa8-4168-baca-b96a73263dea"
/>
```suggestion
The default startup mode is**initial**, which means:
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]