Re: [PR] blog [seatunnel-website]

via GitHub Wed, 19 Mar 2025 01:22:36 -0700


zhangshenghang commented on code in PR #368:
URL: https://github.com/apache/seatunnel-website/pull/368#discussion_r2002717376



##########
blog/2025-03-19-Data_Pipeline_Tutorial_Synchronizing_from_MySQL_to_PostgreSQL_Based_on_Apache_SeaTunnel.md:
##########
@@ -0,0 +1,341 @@
+# Data Pipeline Tutorial：Synchronizing from MySQL to PostgreSQL Based on 
Apache SeaTunnel
+
+
+This article provides a detailed walkthrough of how to achieve full data 
synchronization from MySQL to PostgreSQL using **Apache SeaTunnel 2.3.9**. We 
cover the complete end-to-end process — from environment setup to production 
validation. Let’s dive into the MySQL-to-PostgreSQL synchronization scenario.
+
+Version Requirements：
+
+-   **MySQL：** MySQL 8.3
+-   **PostgreSQL：** PostgreSQL 13.2
+-   **Apache SeaTunnel：** Apache-SeaTunnel-2.3.9
+
+# Preliminaries
+
+# Verify Version Information
+
+Run the following SQL command to check the version：
+```
+-- Check version information 
+select version();
+```
+
+# Enable Master-Slave Replication
+
+```
+-- View replication-related variables
+show variables where variable_name in ('log_bin', 'binlog_format', 
'binlog_row_image', 'gtid_mode', 'enforce_gtid_consistency');
+```
+
+
+For MySQL CDC data synchronization, SeaTunnel needs to read the 
MySQL`binlog`and act as a slave node in the MySQL cluster.  
+
+_Note： In MySQL 8.0+,`binlog`is enabled by default, but replication mode must 
be enabled manually._
+
+```
+-- Enable master-slave replication (execute in sequence)
+-- SET GLOBAL gtid_mode=OFF;
+-- SET GLOBAL enforce_gtid_consistency=OFF;
+SET GLOBAL gtid_mode=OFF_PERMISSIVE;
+SET GLOBAL gtid_mode=ON_PERMISSIVE;
+SET GLOBAL enforce_gtid_consistency=ON;
+SET GLOBAL gtid_mode=ON;
+```
+
+
+# Grant Necessary User Permissions
+
+A user must have`REPLICATION SLAVE`and`REPLICATION CLIENT`privileges：
+
+```
+-- Grant privileges to the user
+CREATE USER 'test'@'%' IDENTIFIED BY 'password';
+GRANT SELECT, RELOAD, SHOW DATABASES, REPLICATION SLAVE, REPLICATION CLIENT ON 
*.* TO 'test';
+FLUSH PRIVILEGES;
+```
+
+# SeaTunnel Cluster Setup
+
+## Cluster Logging
+
+By default, SeaTunnel logs output to a single file. For production, it’s 
preferable to have separate log files per job. Update the logging configuration 
in`log4j2.properties`：
+
+```
+############################ log output to file #############################
+# rootLogger.appenderRef.file.ref = fileAppender
+# Change log output to use independent log files for each job
+rootLogger.appenderRef.file.ref = routingAppender
+############################ log output to file #############################
+```
+
+## Client Configuration
+
+For production clusters, it is recommended to install SeaTunnel under 
the`/opt`directory and point the`SEATUNNEL_HOME`environment variable 
accordingly.
+
+If multiple versions exist, create a symbolic link to align with the server 
deployment directory：
+
+```
+# Create a symlink
+ln -s /opt/apache-seatunnel-2.3.9 /opt/seatunnel
+# Set environment variable
+export SEATUNNEL_HOME=/opt/seatunnel
+```
+
+## Environment Variables Configuration
+
+For Linux servers, add the following lines to`/etc/profile.d/seatunnel.sh`：
+
+```
+echo 'export SEATUNNEL_HOME=/opt/seatunnel' >> /etc/profile.d/seatunnel.sh
+echo 'export PATH=$SEATUNNEL_HOME/bin：$PATH' >> /etc/profile.d/seatunnel.sh
+source /etc/profile.d/seatunnel.sh
+```
+
+# Job Configuration
+
+Note： The configuration below does not cover all options but illustrates 
common production settings.
+
+```
+env {
+  job.mode = "STREAMING"
+  job.name = "DEMO"
+  parallelism = 3
+  checkpoint.interval = 30000  # 30 seconds
+  checkpoint.timeout = 30000   # 30 seconds
+  
+  job.retry.times = 3 
+  job.retry.interval.seconds = 3  # 3 seconds
+}
+```
+The first step is setting up the`env`module, which operates in a streaming 
mode. Therefore, it’s essential to specify the configuration mode as`STREAMING`.
+
+## Task Naming and Management
+
+Configuring a task name is crucial for identifying and managing jobs in a 
production environment. Naming conventions based on database or table names can 
help with monitoring and administration.
+
+## Parallelism Settings
+
+Here, we set the parallelism to**3**, but this value can be adjusted based on 
the cluster size and database performance.
+
+## Checkpoint Configuration
+
+-   **Checkpoint Frequency**： Set to**30 seconds**. If higher precision is 
required, this can be reduced to**10 seconds**or less.

Review Comment:
   ```suggestion
   -   **Checkpoint Frequency**： Set to **30 seconds**. If higher precision is 
required, this can be reduced to **10 seconds**or less.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] blog [seatunnel-website]

Reply via email to