Re: [PR] [Hotfix][docs] Correct errors in the FAQ doc [seatunnel]

via GitHub Mon, 11 Nov 2024 04:38:27 -0800


Hisoka-X commented on code in PR #8011:
URL: https://github.com/apache/seatunnel/pull/8011#discussion_r1836605866



##########
docs/en/faq.md:
##########
@@ -1,332 +1,169 @@
-# FAQs
+# Frequently Asked Questions
 
-## Why should I install a computing engine like Spark or Flink?
+## Do I need to install engines like Spark or Flink to use SeaTunnel?
+No, SeaTunnel supports Zeta, Spark, and Flink as options for the integration 
engine. You can choose one of them. The community especially recommends using 
Zeta, a new-generation high-performance engine specifically built for 
integration scenarios.
+The community provides the most support for Zeta, which also has richer 
features.
 
-SeaTunnel now uses computing engines such as Spark and Flink to complete 
resource scheduling and node communication, so we can focus on the ease of use 
of data synchronization and the development of high-performance components. But 
this is only temporary.
+## What data sources and destinations does SeaTunnel support?
+SeaTunnel supports a variety of data sources and destinations. You can find 
the detailed list on the official website:
+- Supported data sources (Source): 
https://seatunnel.apache.org/docs/connector-v2/source
+- Supported data destinations (Sink): 
https://seatunnel.apache.org/docs/connector-v2/sink
 
-## I have a question, and I cannot solve it by myself
+## Which data sources currently support CDC (Change Data Capture)?
+Currently, CDC is supported for MongoDB CDC, MySQL CDC, OpenGauss CDC, Oracle 
CDC, PostgreSQL CDC, SQL Server CDC, TiDB CDC, etc. For more details, refer to 
the [Source](https://seatunnel.apache.org/docs/connector-v2/source) 
documentation.
 
-I have encountered a problem when using SeaTunnel and I cannot solve it by 
myself. What should I do? First, search in [Issue 
List](https://github.com/apache/seatunnel/issues) or [Mailing 
List](https://lists.apache.org/[email protected]) to see if 
someone has already asked the same question and got an answer. If you cannot 
find an answer to your question, you can contact community members for help in 
[These Ways](https://github.com/apache/seatunnel#contact-us).
-
-## How do I declare a variable?
-
-Do you want to know how to declare a variable in SeaTunnel's configuration, 
and then dynamically replace the value of the variable at runtime?
-
-Since `v1.2.4`, SeaTunnel supports variable substitution in the configuration. 
This feature is often used for timing or non-timing offline processing to 
replace variables such as time and date. The usage is as follows:
-
-Configure the variable name in the configuration. Here is an example of sql 
transform (actually, anywhere in the configuration file the value in `'key = 
value'` can use the variable substitution):
+## Does it support CDC from MySQL replica? How is the log fetched?
+Yes, it is supported by subscribing to the MySQL binlog and parsing the binlog 
on the synchronization server.
 
+## What permissions are required for MySQL CDC synchronization and how to 
enable them?
+You need `SELECT` permission on the relevant databases and tables.
+1. The authorization statement is as follows:
 ```
-...
-transform {
-  sql {
-    query = "select * from user_view where city ='"${city}"' and dt = 
'"${date}"'"
-  }
-}
-...
+GRANT SHOW DATABASES, REPLICATION SLAVE, REPLICATION CLIENT ON *.* TO 
'username'@'host' IDENTIFIED BY 'password';
+FLUSH PRIVILEGES;
 ```
 
-Taking Spark Local mode as an example, the startup command is as follows:
-
-```bash
-./bin/start-seatunnel-spark.sh \
--c ./config/your_app.conf \
--e client \
--m local[2] \
--i city=shanghai \
--i date=20190319
+2. Edit `/etc/mysql/my.cnf` and add the following lines:
+```
+[mysqld]
+log-bin=/var/log/mysql/mysql-bin.log
+expire_logs_days = 7
+binlog_format = ROW
+binlog_row_image=full
 ```
 
-You can use the parameter `-i` or `--variable` followed by `key=value` to 
specify the value of the variable, where the key needs to be same as the 
variable name in the configuration.
-
-## How do I write a configuration item in multi-line text in the configuration 
file?
-
-When a configured text is very long and you want to wrap it, you can use three 
double quotes to indicate its start and end:
-
+3. Restart the MySQL service:
 ```
-var = """
- whatever you want
-"""
+service mysql restart
 ```
 
-## How do I implement variable substitution for multi-line text?
-
-It is a little troublesome to do variable substitution in multi-line text, 
because the variable cannot be included in three double quotation marks:
+## What permissions are required for SQL Server CDC synchronization and how to 
enable them?
+Using SQL Server CDC as a data source requires enabling the MS-CDC feature in 
SQL Server. The steps are as follows:
 
+1. Check if the SQL Server CDC Agent is running:
 ```
-var = """
-your string 1
-"""${you_var}""" your string 2"""
+EXEC xp_servicecontrol N'querystate', N'SQLServerAGENT';
+-- If the result is "running," it means the agent is enabled. Otherwise, it 
needs to be started manually.
 ```
 
-Refer to: 
[lightbend/config#456](https://github.com/lightbend/config/issues/456).
-
-## Is SeaTunnel supported in Azkaban, Oozie, DolphinScheduler?
-
-Of course! See the screenshot below:
-
-![workflow.png](../images/workflow.png)
-
-![azkaban.png](../images/azkaban.png)
-
-## Does SeaTunnel have a case for configuring multiple sources, such as 
configuring elasticsearch and hdfs in source at the same time?
-
+2. If using Linux, enable the SQL Server CDC Agent:
 ```
-env {
-       ...
-}
-
-source {
-  hdfs { ... } 
-  elasticsearch { ... }
-  jdbc {...}
-}
-
-transform {
-    ...
-}
-
-sink {
-       elasticsearch { ... }
-}
+/opt/mssql/bin/mssql-conf setup
+The result that is returned is as follows:
+1) Evaluation (free, no production use rights, 180-day limit)
+2) Developer (free, no production use rights)
+3) Express (free)
+4) Web (PAID)
+5) Standard (PAID)
+6) Enterprise (PAID)
+7) Enterprise Core (PAID)
+8) I bought a license through a retail sales channel and have a product key to 
enter.
 ```
-
-## Are there any HBase plugins?
-
-There is a HBase input plugin. You can download it from here: 
https://github.com/garyelephant/waterdrop-input-hbase .
-
-## How can I use SeaTunnel to write data to Hive?
-
+Choose the appropriate option based on your situation.
+Select option 2 (Developer) for a free version that includes the agent. Enable 
the agent by running:
 ```
-env {
-  spark.sql.catalogImplementation = "hive"
-  spark.hadoop.hive.exec.dynamic.partition = "true"
-  spark.hadoop.hive.exec.dynamic.partition.mode = "nonstrict"
-}
-
-source {
-  sql = "insert into ..."
-}
-
-sink {
-    // The data has been written to hive through the sql source. This is just 
a placeholder, it does not actually work.
-    stdout {
-        limit = 1
-    }
-}
+/opt/mssql/bin/mssql-conf set sqlagent.enabled true
 ```
 
-In addition, SeaTunnel has implemented a `Hive` output plugin after version 
`1.5.7` in `1.x` branch; in `2.x` branch. The Hive plugin for the Spark engine 
has been supported from version `2.0.5`: 
https://github.com/apache/seatunnel/issues/910.
-
-## How does SeaTunnel write multiple instances of ClickHouse to achieve load 
balancing?
-
-1. Write distributed tables directly (not recommended)
-
-2. Add a proxy or domain name (DNS) in front of multiple instances of 
ClickHouse:
-
-   ```
-   {
-       output {
-           clickhouse {
-               host = "ck-proxy.xx.xx:8123"
-               # Local table
-               table = "table_name"
-           }
-       }
-   }
-   ```
-3. Configure multiple instances in the configuration:
-
-   ```
-   {
-       output {
-           clickhouse {
-               host = "ck1:8123,ck2:8123,ck3:8123"
-               # Local table
-               table = "table_name"
-           }
-       }
-   }
-   ```
-4. Use cluster mode:
-
-   ```
-   {
-       output {
-           clickhouse {
-               # Configure only one host
-               host = "ck1:8123"
-               cluster = "clickhouse_cluster_name"
-               # Local table
-               table = "table_name"
-           }
-       }
-   }
-   ```
-
-## How can I solve OOM when SeaTunnel consumes Kafka?
-
-In most cases, OOM is caused by not having a rate limit for consumption. The 
solution is as follows:
-
-For the current limit of Spark consumption of Kafka:
-
-1. Suppose the number of partitions of Kafka `Topic 1` you consume with 
KafkaStream = N.
-
-2. Assuming that the production speed of the message producer (Producer) of 
`Topic 1` is K messages/second, the speed of write messages to the partition 
must be uniform.
-
-3. Suppose that, after testing, it is found that the processing capacity of 
Spark Executor per core per second is M.
-
-The following conclusions can be drawn:
-
-1. If you want to make Spark's consumption of `Topic 1` keep up with its 
production speed, then you need `spark.executor.cores` * 
`spark.executor.instances` >= K / M
+If using Windows, enable SQL Server Agent (e.g., for SQL Server 2008):
+   - Refer to the [official 
documentation](https://learn.microsoft.com/en-us/previous-versions/sql/sql-server-2008-r2/ms191454(v=sql.105)).
+```
+Open "SQL Server Configuration Manager" from the Start menu, navigate to "SQL 
Server Services," right-click the "SQL Server Agent" instance, and start it.
+```
 
-2. When a data delay occurs, if you want the consumption speed not to be too 
fast, resulting in spark executor OOM, then you need to configure 
`spark.streaming.kafka.maxRatePerPartition` <= (`spark.executor.cores` * 
`spark.executor.instances`) * M / N
+3. Firstly, enable CDC at the database level:
+```
+USE TestDB; -- Replace with your actual database name
+EXEC sys.sp_cdc_enable_db;
 
-3. In general, both M and N are determined, and the conclusion can be drawn 
from 2: The size of `spark.streaming.kafka.maxRatePerPartition` is positively 
correlated with the size of `spark.executor.cores` * 
`spark.executor.instances`, and it can be increased while increasing the 
resource `maxRatePerPartition` to speed up consumption.
+-- Check if the database has CDC enabled
+SELECT name, is_cdc_enabled
+FROM sys.databases
+WHERE name = 'database'; -- Replace with the name of your database
+```
 
-![Kafka](../images/kafka.png)
+4. Secondly, enable CDC at the table level:
+```
+USE TestDB; -- Replace with your actual database name
+EXEC sys.sp_cdc_enable_table
+@source_schema = 'dbo',
+@source_name = 'table', -- Replace with the table name
+@role_name = NULL,
+@capture_instance = 'table'; -- Replace with a unique capture instance name
 
-## How can I solve the Error `Exception in thread "main" 
java.lang.NoSuchFieldError: INSTANCE`?
+-- Check if the table has CDC enabled
+SELECT name, is_tracked_by_cdc
+FROM sys.tables
+WHERE name = 'table'; -- Replace with the table name
+```
 
-The reason is that the version of httpclient.jar that comes with the CDH 
version of Spark is lower, and The httpclient version that ClickHouse JDBC is 
based on is 4.5.2, and the package versions conflict. The solution is to 
replace the jar package that comes with CDH with the httpclient-4.5.2 version.
+## Does SeaTunnel support CDC synchronization for tables without primary keys?
+No, CDC synchronization is not supported for tables without primary keys. This 
is because, if there are two identical rows upstream and one is deleted or 
modified, it would be impossible to distinguish which row should be deleted or 
modified downstream, potentially resulting in both rows being affected. 
 
-## The default JDK of my Spark cluster is JDK7. After I install JDK8, how can 
I specify that SeaTunnel starts with JDK8?
+## Error during PostgreSQL task execution: Caused by: 
org.postgresql.util.PSQLException: ERROR: all replication slots are in use
+This error occurs when the replication slots in PostgreSQL are full and need 
to be released. Modify the `postgresql.conf` file to increase `max_wal_senders` 
and `max_replication_slots`, then restart the PostgreSQL service using the 
command:
+```
+systemctl restart postgresql
+```
+Example configuration:
+```
+max_wal_senders = 1000      # max number of walsender processes
+max_replication_slots = 1000     # max number of replication slots
+```
 
-In SeaTunnel's config file, specify the following configuration:
+## What should I do if I have a problem that I can't solve on my own?
+If you encounter an issue while using SeaTunnel that you cannot resolve, you 
can:
+1. Search the [issue list](https://github.com/apache/seatunnel/issues) or 
[mailing list](https://lists.apache.org/[email protected]) to 
see if someone else has asked the same question and received an answer.
+2. If you can't find an answer, reach out to the community for help using 
[these methods](https://github.com/apache/seatunnel#contact-us).
 
-```shell
-spark {
- ...
- spark.executorEnv.JAVA_HOME="/your/java_8_home/directory"
- spark.yarn.appMasterEnv.JAVA_HOME="/your/java_8_home/directory"
- ...
+## How do I declare variables?
+Do you want to know how to declare a variable in a SeaTunnel configuration and 
dynamically replace its value at runtime? This feature is often used in both 
scheduled and non-scheduled offline processing as a placeholder for variables 
such as time and date. Here’s how to do it:
+Declare a variable name in the configuration. Below is an example of a SQL 
transformation (in fact, any value in `key = value` format can use variable 
substitution):
+```
+...
+transform {
+  Sql {
+    query = "select * from user_view where city ='${city}' and dt = '${date}'"
+  }
 }
+...
 ```
-
-## What should I do if OOM always appears when running SeaTunnel in Spark 
local[*] mode?
-
-If you run in local mode, you need to modify the `start-seatunnel.sh` startup 
script. After `spark-submit`, add a parameter `--driver-memory 4g` . Under 
normal circumstances, local mode is not used in the production environment. 
Therefore, this parameter generally does not need to be set during On YARN. 
See: [Application 
Properties](https://spark.apache.org/docs/latest/configuration.html#application-properties)
 for details.
-
-## Where can I place self-written plugins or third-party jdbc.jars to be 
loaded by SeaTunnel?
-
-Place the Jar package under the specified structure of the plugins directory:
-
+To run SeaTunnel in Zeta Local mode, use the following command:
 ```bash
-cd SeaTunnel
-mkdir -p plugins/my_plugins/lib
-cp third-part.jar plugins/my_plugins/lib
+$SEATUNNEL_HOME/bin/seatunnel.sh \
+-c $SEATUNNEL_HOME/config/your_app.conf \
+-m local[2] \
+-i city=shanghai \
+-i date=20231110
 ```
+Use the `-i` or `--variable` parameter followed by `key=value` to specify the 
variable's value, ensuring that `key` matches the variable name in the 
configuration. For more details, refer to: 
https://seatunnel.apache.org/docs/concept/config

Review Comment:
   ```suggestion
   Use the `-i` or `--variable` parameter followed by `key=value` to specify 
the variable's value, ensuring that `key` matches the variable name in the 
configuration. For more details, refer to: 
https://seatunnel.apache.org/docs/concept/config/#config-variable-substitution
   ```



##########
docs/en/faq.md:
##########
@@ -1,332 +1,169 @@
-# FAQs
+# Frequently Asked Questions
 
-## Why should I install a computing engine like Spark or Flink?
+## Do I need to install engines like Spark or Flink to use SeaTunnel?
+No, SeaTunnel supports Zeta, Spark, and Flink as options for the integration 
engine. You can choose one of them. The community especially recommends using 
Zeta, a new-generation high-performance engine specifically built for 
integration scenarios.
+The community provides the most support for Zeta, which also has richer 
features.

Review Comment:
   The answer is not always no, it depends on the engine used by the user. If 
you use Zeta, you need to install the engine, just like Flink/Spark.



##########
docs/en/faq.md:
##########
@@ -1,332 +1,169 @@
-# FAQs
+# Frequently Asked Questions
 
-## Why should I install a computing engine like Spark or Flink?
+## Do I need to install engines like Spark or Flink to use SeaTunnel?
+No, SeaTunnel supports Zeta, Spark, and Flink as options for the integration 
engine. You can choose one of them. The community especially recommends using 
Zeta, a new-generation high-performance engine specifically built for 
integration scenarios.
+The community provides the most support for Zeta, which also has richer 
features.
 
-SeaTunnel now uses computing engines such as Spark and Flink to complete 
resource scheduling and node communication, so we can focus on the ease of use 
of data synchronization and the development of high-performance components. But 
this is only temporary.
+## What data sources and destinations does SeaTunnel support?
+SeaTunnel supports a variety of data sources and destinations. You can find 
the detailed list on the official website:
+- Supported data sources (Source): 
https://seatunnel.apache.org/docs/connector-v2/source
+- Supported data destinations (Sink): 
https://seatunnel.apache.org/docs/connector-v2/sink
 
-## I have a question, and I cannot solve it by myself
+## Which data sources currently support CDC (Change Data Capture)?
+Currently, CDC is supported for MongoDB CDC, MySQL CDC, OpenGauss CDC, Oracle 
CDC, PostgreSQL CDC, SQL Server CDC, TiDB CDC, etc. For more details, refer to 
the [Source](https://seatunnel.apache.org/docs/connector-v2/source) 
documentation.
 
-I have encountered a problem when using SeaTunnel and I cannot solve it by 
myself. What should I do? First, search in [Issue 
List](https://github.com/apache/seatunnel/issues) or [Mailing 
List](https://lists.apache.org/[email protected]) to see if 
someone has already asked the same question and got an answer. If you cannot 
find an answer to your question, you can contact community members for help in 
[These Ways](https://github.com/apache/seatunnel#contact-us).
-
-## How do I declare a variable?
-
-Do you want to know how to declare a variable in SeaTunnel's configuration, 
and then dynamically replace the value of the variable at runtime?
-
-Since `v1.2.4`, SeaTunnel supports variable substitution in the configuration. 
This feature is often used for timing or non-timing offline processing to 
replace variables such as time and date. The usage is as follows:
-
-Configure the variable name in the configuration. Here is an example of sql 
transform (actually, anywhere in the configuration file the value in `'key = 
value'` can use the variable substitution):
+## Does it support CDC from MySQL replica? How is the log fetched?
+Yes, it is supported by subscribing to the MySQL binlog and parsing the binlog 
on the synchronization server.

Review Comment:
   This part already had in 
https://seatunnel.apache.org/docs/2.3.8/connector-v2/source/MySQL-CDC#enabling-the-mysql-binlog



##########
docs/en/faq.md:
##########
@@ -1,332 +1,169 @@
-# FAQs
+# Frequently Asked Questions
 
-## Why should I install a computing engine like Spark or Flink?
+## Do I need to install engines like Spark or Flink to use SeaTunnel?
+No, SeaTunnel supports Zeta, Spark, and Flink as options for the integration 
engine. You can choose one of them. The community especially recommends using 
Zeta, a new-generation high-performance engine specifically built for 
integration scenarios.
+The community provides the most support for Zeta, which also has richer 
features.
 
-SeaTunnel now uses computing engines such as Spark and Flink to complete 
resource scheduling and node communication, so we can focus on the ease of use 
of data synchronization and the development of high-performance components. But 
this is only temporary.
+## What data sources and destinations does SeaTunnel support?
+SeaTunnel supports a variety of data sources and destinations. You can find 
the detailed list on the official website:
+- Supported data sources (Source): 
https://seatunnel.apache.org/docs/connector-v2/source
+- Supported data destinations (Sink): 
https://seatunnel.apache.org/docs/connector-v2/sink
 
-## I have a question, and I cannot solve it by myself
+## Which data sources currently support CDC (Change Data Capture)?
+Currently, CDC is supported for MongoDB CDC, MySQL CDC, OpenGauss CDC, Oracle 
CDC, PostgreSQL CDC, SQL Server CDC, TiDB CDC, etc. For more details, refer to 
the [Source](https://seatunnel.apache.org/docs/connector-v2/source) 
documentation.
 
-I have encountered a problem when using SeaTunnel and I cannot solve it by 
myself. What should I do? First, search in [Issue 
List](https://github.com/apache/seatunnel/issues) or [Mailing 
List](https://lists.apache.org/[email protected]) to see if 
someone has already asked the same question and got an answer. If you cannot 
find an answer to your question, you can contact community members for help in 
[These Ways](https://github.com/apache/seatunnel#contact-us).
-
-## How do I declare a variable?
-
-Do you want to know how to declare a variable in SeaTunnel's configuration, 
and then dynamically replace the value of the variable at runtime?
-
-Since `v1.2.4`, SeaTunnel supports variable substitution in the configuration. 
This feature is often used for timing or non-timing offline processing to 
replace variables such as time and date. The usage is as follows:
-
-Configure the variable name in the configuration. Here is an example of sql 
transform (actually, anywhere in the configuration file the value in `'key = 
value'` can use the variable substitution):
+## Does it support CDC from MySQL replica? How is the log fetched?
+Yes, it is supported by subscribing to the MySQL binlog and parsing the binlog 
on the synchronization server.
 
+## What permissions are required for MySQL CDC synchronization and how to 
enable them?
+You need `SELECT` permission on the relevant databases and tables.
+1. The authorization statement is as follows:
 ```
-...
-transform {
-  sql {
-    query = "select * from user_view where city ='"${city}"' and dt = 
'"${date}"'"
-  }
-}
-...
+GRANT SHOW DATABASES, REPLICATION SLAVE, REPLICATION CLIENT ON *.* TO 
'username'@'host' IDENTIFIED BY 'password';
+FLUSH PRIVILEGES;
 ```
 
-Taking Spark Local mode as an example, the startup command is as follows:
-
-```bash
-./bin/start-seatunnel-spark.sh \
--c ./config/your_app.conf \
--e client \
--m local[2] \
--i city=shanghai \
--i date=20190319
+2. Edit `/etc/mysql/my.cnf` and add the following lines:
+```
+[mysqld]
+log-bin=/var/log/mysql/mysql-bin.log
+expire_logs_days = 7
+binlog_format = ROW
+binlog_row_image=full
 ```
 
-You can use the parameter `-i` or `--variable` followed by `key=value` to 
specify the value of the variable, where the key needs to be same as the 
variable name in the configuration.
-
-## How do I write a configuration item in multi-line text in the configuration 
file?
-
-When a configured text is very long and you want to wrap it, you can use three 
double quotes to indicate its start and end:
-
+3. Restart the MySQL service:
 ```
-var = """
- whatever you want
-"""
+service mysql restart
 ```
 
-## How do I implement variable substitution for multi-line text?
-
-It is a little troublesome to do variable substitution in multi-line text, 
because the variable cannot be included in three double quotation marks:
+## What permissions are required for SQL Server CDC synchronization and how to 
enable them?
+Using SQL Server CDC as a data source requires enabling the MS-CDC feature in 
SQL Server. The steps are as follows:

Review Comment:
   ditto



##########
docs/en/faq.md:
##########
@@ -1,332 +1,169 @@
-# FAQs
+# Frequently Asked Questions
 
-## Why should I install a computing engine like Spark or Flink?
+## Do I need to install engines like Spark or Flink to use SeaTunnel?
+No, SeaTunnel supports Zeta, Spark, and Flink as options for the integration 
engine. You can choose one of them. The community especially recommends using 
Zeta, a new-generation high-performance engine specifically built for 
integration scenarios.
+The community provides the most support for Zeta, which also has richer 
features.
 
-SeaTunnel now uses computing engines such as Spark and Flink to complete 
resource scheduling and node communication, so we can focus on the ease of use 
of data synchronization and the development of high-performance components. But 
this is only temporary.
+## What data sources and destinations does SeaTunnel support?
+SeaTunnel supports a variety of data sources and destinations. You can find 
the detailed list on the official website:
+- Supported data sources (Source): 
https://seatunnel.apache.org/docs/connector-v2/source
+- Supported data destinations (Sink): 
https://seatunnel.apache.org/docs/connector-v2/sink
 
-## I have a question, and I cannot solve it by myself
+## Which data sources currently support CDC (Change Data Capture)?
+Currently, CDC is supported for MongoDB CDC, MySQL CDC, OpenGauss CDC, Oracle 
CDC, PostgreSQL CDC, SQL Server CDC, TiDB CDC, etc. For more details, refer to 
the [Source](https://seatunnel.apache.org/docs/connector-v2/source) 
documentation.
 
-I have encountered a problem when using SeaTunnel and I cannot solve it by 
myself. What should I do? First, search in [Issue 
List](https://github.com/apache/seatunnel/issues) or [Mailing 
List](https://lists.apache.org/[email protected]) to see if 
someone has already asked the same question and got an answer. If you cannot 
find an answer to your question, you can contact community members for help in 
[These Ways](https://github.com/apache/seatunnel#contact-us).
-
-## How do I declare a variable?
-
-Do you want to know how to declare a variable in SeaTunnel's configuration, 
and then dynamically replace the value of the variable at runtime?
-
-Since `v1.2.4`, SeaTunnel supports variable substitution in the configuration. 
This feature is often used for timing or non-timing offline processing to 
replace variables such as time and date. The usage is as follows:
-
-Configure the variable name in the configuration. Here is an example of sql 
transform (actually, anywhere in the configuration file the value in `'key = 
value'` can use the variable substitution):
+## Does it support CDC from MySQL replica? How is the log fetched?
+Yes, it is supported by subscribing to the MySQL binlog and parsing the binlog 
on the synchronization server.
 
+## What permissions are required for MySQL CDC synchronization and how to 
enable them?

Review Comment:
   ditto.



##########
docs/en/faq.md:
##########
@@ -1,332 +1,169 @@
-# FAQs
+# Frequently Asked Questions
 
-## Why should I install a computing engine like Spark or Flink?
+## Do I need to install engines like Spark or Flink to use SeaTunnel?
+No, SeaTunnel supports Zeta, Spark, and Flink as options for the integration 
engine. You can choose one of them. The community especially recommends using 
Zeta, a new-generation high-performance engine specifically built for 
integration scenarios.
+The community provides the most support for Zeta, which also has richer 
features.
 
-SeaTunnel now uses computing engines such as Spark and Flink to complete 
resource scheduling and node communication, so we can focus on the ease of use 
of data synchronization and the development of high-performance components. But 
this is only temporary.
+## What data sources and destinations does SeaTunnel support?
+SeaTunnel supports a variety of data sources and destinations. You can find 
the detailed list on the official website:
+- Supported data sources (Source): 
https://seatunnel.apache.org/docs/connector-v2/source
+- Supported data destinations (Sink): 
https://seatunnel.apache.org/docs/connector-v2/sink
 
-## I have a question, and I cannot solve it by myself
+## Which data sources currently support CDC (Change Data Capture)?
+Currently, CDC is supported for MongoDB CDC, MySQL CDC, OpenGauss CDC, Oracle 
CDC, PostgreSQL CDC, SQL Server CDC, TiDB CDC, etc. For more details, refer to 
the [Source](https://seatunnel.apache.org/docs/connector-v2/source) 
documentation.
 
-I have encountered a problem when using SeaTunnel and I cannot solve it by 
myself. What should I do? First, search in [Issue 
List](https://github.com/apache/seatunnel/issues) or [Mailing 
List](https://lists.apache.org/[email protected]) to see if 
someone has already asked the same question and got an answer. If you cannot 
find an answer to your question, you can contact community members for help in 
[These Ways](https://github.com/apache/seatunnel#contact-us).
-
-## How do I declare a variable?
-
-Do you want to know how to declare a variable in SeaTunnel's configuration, 
and then dynamically replace the value of the variable at runtime?
-
-Since `v1.2.4`, SeaTunnel supports variable substitution in the configuration. 
This feature is often used for timing or non-timing offline processing to 
replace variables such as time and date. The usage is as follows:
-
-Configure the variable name in the configuration. Here is an example of sql 
transform (actually, anywhere in the configuration file the value in `'key = 
value'` can use the variable substitution):
+## Does it support CDC from MySQL replica? How is the log fetched?
+Yes, it is supported by subscribing to the MySQL binlog and parsing the binlog 
on the synchronization server.
 
+## What permissions are required for MySQL CDC synchronization and how to 
enable them?
+You need `SELECT` permission on the relevant databases and tables.
+1. The authorization statement is as follows:
 ```
-...
-transform {
-  sql {
-    query = "select * from user_view where city ='"${city}"' and dt = 
'"${date}"'"
-  }
-}
-...
+GRANT SHOW DATABASES, REPLICATION SLAVE, REPLICATION CLIENT ON *.* TO 
'username'@'host' IDENTIFIED BY 'password';
+FLUSH PRIVILEGES;
 ```
 
-Taking Spark Local mode as an example, the startup command is as follows:
-
-```bash
-./bin/start-seatunnel-spark.sh \
--c ./config/your_app.conf \
--e client \
--m local[2] \
--i city=shanghai \
--i date=20190319
+2. Edit `/etc/mysql/my.cnf` and add the following lines:
+```
+[mysqld]
+log-bin=/var/log/mysql/mysql-bin.log
+expire_logs_days = 7
+binlog_format = ROW
+binlog_row_image=full
 ```
 
-You can use the parameter `-i` or `--variable` followed by `key=value` to 
specify the value of the variable, where the key needs to be same as the 
variable name in the configuration.
-
-## How do I write a configuration item in multi-line text in the configuration 
file?
-
-When a configured text is very long and you want to wrap it, you can use three 
double quotes to indicate its start and end:
-
+3. Restart the MySQL service:
 ```
-var = """
- whatever you want
-"""
+service mysql restart
 ```
 
-## How do I implement variable substitution for multi-line text?
-
-It is a little troublesome to do variable substitution in multi-line text, 
because the variable cannot be included in three double quotation marks:
+## What permissions are required for SQL Server CDC synchronization and how to 
enable them?
+Using SQL Server CDC as a data source requires enabling the MS-CDC feature in 
SQL Server. The steps are as follows:
 
+1. Check if the SQL Server CDC Agent is running:
 ```
-var = """
-your string 1
-"""${you_var}""" your string 2"""
+EXEC xp_servicecontrol N'querystate', N'SQLServerAGENT';
+-- If the result is "running," it means the agent is enabled. Otherwise, it 
needs to be started manually.
 ```
 
-Refer to: 
[lightbend/config#456](https://github.com/lightbend/config/issues/456).
-
-## Is SeaTunnel supported in Azkaban, Oozie, DolphinScheduler?
-
-Of course! See the screenshot below:
-
-![workflow.png](../images/workflow.png)
-
-![azkaban.png](../images/azkaban.png)
-
-## Does SeaTunnel have a case for configuring multiple sources, such as 
configuring elasticsearch and hdfs in source at the same time?
-
+2. If using Linux, enable the SQL Server CDC Agent:
 ```
-env {
-       ...
-}
-
-source {
-  hdfs { ... } 
-  elasticsearch { ... }
-  jdbc {...}
-}
-
-transform {
-    ...
-}
-
-sink {
-       elasticsearch { ... }
-}
+/opt/mssql/bin/mssql-conf setup
+The result that is returned is as follows:
+1) Evaluation (free, no production use rights, 180-day limit)
+2) Developer (free, no production use rights)
+3) Express (free)
+4) Web (PAID)
+5) Standard (PAID)
+6) Enterprise (PAID)
+7) Enterprise Core (PAID)
+8) I bought a license through a retail sales channel and have a product key to 
enter.
 ```
-
-## Are there any HBase plugins?
-
-There is a HBase input plugin. You can download it from here: 
https://github.com/garyelephant/waterdrop-input-hbase .
-
-## How can I use SeaTunnel to write data to Hive?
-
+Choose the appropriate option based on your situation.
+Select option 2 (Developer) for a free version that includes the agent. Enable 
the agent by running:
 ```
-env {
-  spark.sql.catalogImplementation = "hive"
-  spark.hadoop.hive.exec.dynamic.partition = "true"
-  spark.hadoop.hive.exec.dynamic.partition.mode = "nonstrict"
-}
-
-source {
-  sql = "insert into ..."
-}
-
-sink {
-    // The data has been written to hive through the sql source. This is just 
a placeholder, it does not actually work.
-    stdout {
-        limit = 1
-    }
-}
+/opt/mssql/bin/mssql-conf set sqlagent.enabled true
 ```
 
-In addition, SeaTunnel has implemented a `Hive` output plugin after version 
`1.5.7` in `1.x` branch; in `2.x` branch. The Hive plugin for the Spark engine 
has been supported from version `2.0.5`: 
https://github.com/apache/seatunnel/issues/910.
-
-## How does SeaTunnel write multiple instances of ClickHouse to achieve load 
balancing?
-
-1. Write distributed tables directly (not recommended)
-
-2. Add a proxy or domain name (DNS) in front of multiple instances of 
ClickHouse:
-
-   ```
-   {
-       output {
-           clickhouse {
-               host = "ck-proxy.xx.xx:8123"
-               # Local table
-               table = "table_name"
-           }
-       }
-   }
-   ```
-3. Configure multiple instances in the configuration:
-
-   ```
-   {
-       output {
-           clickhouse {
-               host = "ck1:8123,ck2:8123,ck3:8123"
-               # Local table
-               table = "table_name"
-           }
-       }
-   }
-   ```
-4. Use cluster mode:
-
-   ```
-   {
-       output {
-           clickhouse {
-               # Configure only one host
-               host = "ck1:8123"
-               cluster = "clickhouse_cluster_name"
-               # Local table
-               table = "table_name"
-           }
-       }
-   }
-   ```
-
-## How can I solve OOM when SeaTunnel consumes Kafka?
-
-In most cases, OOM is caused by not having a rate limit for consumption. The 
solution is as follows:
-
-For the current limit of Spark consumption of Kafka:
-
-1. Suppose the number of partitions of Kafka `Topic 1` you consume with 
KafkaStream = N.
-
-2. Assuming that the production speed of the message producer (Producer) of 
`Topic 1` is K messages/second, the speed of write messages to the partition 
must be uniform.
-
-3. Suppose that, after testing, it is found that the processing capacity of 
Spark Executor per core per second is M.
-
-The following conclusions can be drawn:
-
-1. If you want to make Spark's consumption of `Topic 1` keep up with its 
production speed, then you need `spark.executor.cores` * 
`spark.executor.instances` >= K / M
+If using Windows, enable SQL Server Agent (e.g., for SQL Server 2008):
+   - Refer to the [official 
documentation](https://learn.microsoft.com/en-us/previous-versions/sql/sql-server-2008-r2/ms191454(v=sql.105)).
+```
+Open "SQL Server Configuration Manager" from the Start menu, navigate to "SQL 
Server Services," right-click the "SQL Server Agent" instance, and start it.
+```
 
-2. When a data delay occurs, if you want the consumption speed not to be too 
fast, resulting in spark executor OOM, then you need to configure 
`spark.streaming.kafka.maxRatePerPartition` <= (`spark.executor.cores` * 
`spark.executor.instances`) * M / N
+3. Firstly, enable CDC at the database level:
+```
+USE TestDB; -- Replace with your actual database name
+EXEC sys.sp_cdc_enable_db;
 
-3. In general, both M and N are determined, and the conclusion can be drawn 
from 2: The size of `spark.streaming.kafka.maxRatePerPartition` is positively 
correlated with the size of `spark.executor.cores` * 
`spark.executor.instances`, and it can be increased while increasing the 
resource `maxRatePerPartition` to speed up consumption.
+-- Check if the database has CDC enabled
+SELECT name, is_cdc_enabled
+FROM sys.databases
+WHERE name = 'database'; -- Replace with the name of your database
+```
 
-![Kafka](../images/kafka.png)
+4. Secondly, enable CDC at the table level:
+```
+USE TestDB; -- Replace with your actual database name
+EXEC sys.sp_cdc_enable_table
+@source_schema = 'dbo',
+@source_name = 'table', -- Replace with the table name
+@role_name = NULL,
+@capture_instance = 'table'; -- Replace with a unique capture instance name
 
-## How can I solve the Error `Exception in thread "main" 
java.lang.NoSuchFieldError: INSTANCE`?
+-- Check if the table has CDC enabled
+SELECT name, is_tracked_by_cdc
+FROM sys.tables
+WHERE name = 'table'; -- Replace with the table name
+```
 
-The reason is that the version of httpclient.jar that comes with the CDH 
version of Spark is lower, and The httpclient version that ClickHouse JDBC is 
based on is 4.5.2, and the package versions conflict. The solution is to 
replace the jar package that comes with CDH with the httpclient-4.5.2 version.
+## Does SeaTunnel support CDC synchronization for tables without primary keys?
+No, CDC synchronization is not supported for tables without primary keys. This 
is because, if there are two identical rows upstream and one is deleted or 
modified, it would be impossible to distinguish which row should be deleted or 
modified downstream, potentially resulting in both rows being affected. 
 
-## The default JDK of my Spark cluster is JDK7. After I install JDK8, how can 
I specify that SeaTunnel starts with JDK8?
+## Error during PostgreSQL task execution: Caused by: 
org.postgresql.util.PSQLException: ERROR: all replication slots are in use

Review Comment:
   It's strange to put connector related questions here, why not put them on 
the connector's own page.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [Hotfix][docs] Correct errors in the FAQ doc [seatunnel]

Reply via email to