Hisoka-X commented on code in PR #8011: URL: https://github.com/apache/seatunnel/pull/8011#discussion_r1836605866
########## docs/en/faq.md: ########## @@ -1,332 +1,169 @@ -# FAQs +# Frequently Asked Questions -## Why should I install a computing engine like Spark or Flink? +## Do I need to install engines like Spark or Flink to use SeaTunnel? +No, SeaTunnel supports Zeta, Spark, and Flink as options for the integration engine. You can choose one of them. The community especially recommends using Zeta, a new-generation high-performance engine specifically built for integration scenarios. +The community provides the most support for Zeta, which also has richer features. -SeaTunnel now uses computing engines such as Spark and Flink to complete resource scheduling and node communication, so we can focus on the ease of use of data synchronization and the development of high-performance components. But this is only temporary. +## What data sources and destinations does SeaTunnel support? +SeaTunnel supports a variety of data sources and destinations. You can find the detailed list on the official website: +- Supported data sources (Source): https://seatunnel.apache.org/docs/connector-v2/source +- Supported data destinations (Sink): https://seatunnel.apache.org/docs/connector-v2/sink -## I have a question, and I cannot solve it by myself +## Which data sources currently support CDC (Change Data Capture)? +Currently, CDC is supported for MongoDB CDC, MySQL CDC, OpenGauss CDC, Oracle CDC, PostgreSQL CDC, SQL Server CDC, TiDB CDC, etc. For more details, refer to the [Source](https://seatunnel.apache.org/docs/connector-v2/source) documentation. -I have encountered a problem when using SeaTunnel and I cannot solve it by myself. What should I do? First, search in [Issue List](https://github.com/apache/seatunnel/issues) or [Mailing List](https://lists.apache.org/[email protected]) to see if someone has already asked the same question and got an answer. If you cannot find an answer to your question, you can contact community members for help in [These Ways](https://github.com/apache/seatunnel#contact-us). - -## How do I declare a variable? - -Do you want to know how to declare a variable in SeaTunnel's configuration, and then dynamically replace the value of the variable at runtime? - -Since `v1.2.4`, SeaTunnel supports variable substitution in the configuration. This feature is often used for timing or non-timing offline processing to replace variables such as time and date. The usage is as follows: - -Configure the variable name in the configuration. Here is an example of sql transform (actually, anywhere in the configuration file the value in `'key = value'` can use the variable substitution): +## Does it support CDC from MySQL replica? How is the log fetched? +Yes, it is supported by subscribing to the MySQL binlog and parsing the binlog on the synchronization server. +## What permissions are required for MySQL CDC synchronization and how to enable them? +You need `SELECT` permission on the relevant databases and tables. +1. The authorization statement is as follows: ``` -... -transform { - sql { - query = "select * from user_view where city ='"${city}"' and dt = '"${date}"'" - } -} -... +GRANT SHOW DATABASES, REPLICATION SLAVE, REPLICATION CLIENT ON *.* TO 'username'@'host' IDENTIFIED BY 'password'; +FLUSH PRIVILEGES; ``` -Taking Spark Local mode as an example, the startup command is as follows: - -```bash -./bin/start-seatunnel-spark.sh \ --c ./config/your_app.conf \ --e client \ --m local[2] \ --i city=shanghai \ --i date=20190319 +2. Edit `/etc/mysql/my.cnf` and add the following lines: +``` +[mysqld] +log-bin=/var/log/mysql/mysql-bin.log +expire_logs_days = 7 +binlog_format = ROW +binlog_row_image=full ``` -You can use the parameter `-i` or `--variable` followed by `key=value` to specify the value of the variable, where the key needs to be same as the variable name in the configuration. - -## How do I write a configuration item in multi-line text in the configuration file? - -When a configured text is very long and you want to wrap it, you can use three double quotes to indicate its start and end: - +3. Restart the MySQL service: ``` -var = """ - whatever you want -""" +service mysql restart ``` -## How do I implement variable substitution for multi-line text? - -It is a little troublesome to do variable substitution in multi-line text, because the variable cannot be included in three double quotation marks: +## What permissions are required for SQL Server CDC synchronization and how to enable them? +Using SQL Server CDC as a data source requires enabling the MS-CDC feature in SQL Server. The steps are as follows: +1. Check if the SQL Server CDC Agent is running: ``` -var = """ -your string 1 -"""${you_var}""" your string 2""" +EXEC xp_servicecontrol N'querystate', N'SQLServerAGENT'; +-- If the result is "running," it means the agent is enabled. Otherwise, it needs to be started manually. ``` -Refer to: [lightbend/config#456](https://github.com/lightbend/config/issues/456). - -## Is SeaTunnel supported in Azkaban, Oozie, DolphinScheduler? - -Of course! See the screenshot below: - - - - - -## Does SeaTunnel have a case for configuring multiple sources, such as configuring elasticsearch and hdfs in source at the same time? - +2. If using Linux, enable the SQL Server CDC Agent: ``` -env { - ... -} - -source { - hdfs { ... } - elasticsearch { ... } - jdbc {...} -} - -transform { - ... -} - -sink { - elasticsearch { ... } -} +/opt/mssql/bin/mssql-conf setup +The result that is returned is as follows: +1) Evaluation (free, no production use rights, 180-day limit) +2) Developer (free, no production use rights) +3) Express (free) +4) Web (PAID) +5) Standard (PAID) +6) Enterprise (PAID) +7) Enterprise Core (PAID) +8) I bought a license through a retail sales channel and have a product key to enter. ``` - -## Are there any HBase plugins? - -There is a HBase input plugin. You can download it from here: https://github.com/garyelephant/waterdrop-input-hbase . - -## How can I use SeaTunnel to write data to Hive? - +Choose the appropriate option based on your situation. +Select option 2 (Developer) for a free version that includes the agent. Enable the agent by running: ``` -env { - spark.sql.catalogImplementation = "hive" - spark.hadoop.hive.exec.dynamic.partition = "true" - spark.hadoop.hive.exec.dynamic.partition.mode = "nonstrict" -} - -source { - sql = "insert into ..." -} - -sink { - // The data has been written to hive through the sql source. This is just a placeholder, it does not actually work. - stdout { - limit = 1 - } -} +/opt/mssql/bin/mssql-conf set sqlagent.enabled true ``` -In addition, SeaTunnel has implemented a `Hive` output plugin after version `1.5.7` in `1.x` branch; in `2.x` branch. The Hive plugin for the Spark engine has been supported from version `2.0.5`: https://github.com/apache/seatunnel/issues/910. - -## How does SeaTunnel write multiple instances of ClickHouse to achieve load balancing? - -1. Write distributed tables directly (not recommended) - -2. Add a proxy or domain name (DNS) in front of multiple instances of ClickHouse: - - ``` - { - output { - clickhouse { - host = "ck-proxy.xx.xx:8123" - # Local table - table = "table_name" - } - } - } - ``` -3. Configure multiple instances in the configuration: - - ``` - { - output { - clickhouse { - host = "ck1:8123,ck2:8123,ck3:8123" - # Local table - table = "table_name" - } - } - } - ``` -4. Use cluster mode: - - ``` - { - output { - clickhouse { - # Configure only one host - host = "ck1:8123" - cluster = "clickhouse_cluster_name" - # Local table - table = "table_name" - } - } - } - ``` - -## How can I solve OOM when SeaTunnel consumes Kafka? - -In most cases, OOM is caused by not having a rate limit for consumption. The solution is as follows: - -For the current limit of Spark consumption of Kafka: - -1. Suppose the number of partitions of Kafka `Topic 1` you consume with KafkaStream = N. - -2. Assuming that the production speed of the message producer (Producer) of `Topic 1` is K messages/second, the speed of write messages to the partition must be uniform. - -3. Suppose that, after testing, it is found that the processing capacity of Spark Executor per core per second is M. - -The following conclusions can be drawn: - -1. If you want to make Spark's consumption of `Topic 1` keep up with its production speed, then you need `spark.executor.cores` * `spark.executor.instances` >= K / M +If using Windows, enable SQL Server Agent (e.g., for SQL Server 2008): + - Refer to the [official documentation](https://learn.microsoft.com/en-us/previous-versions/sql/sql-server-2008-r2/ms191454(v=sql.105)). +``` +Open "SQL Server Configuration Manager" from the Start menu, navigate to "SQL Server Services," right-click the "SQL Server Agent" instance, and start it. +``` -2. When a data delay occurs, if you want the consumption speed not to be too fast, resulting in spark executor OOM, then you need to configure `spark.streaming.kafka.maxRatePerPartition` <= (`spark.executor.cores` * `spark.executor.instances`) * M / N +3. Firstly, enable CDC at the database level: +``` +USE TestDB; -- Replace with your actual database name +EXEC sys.sp_cdc_enable_db; -3. In general, both M and N are determined, and the conclusion can be drawn from 2: The size of `spark.streaming.kafka.maxRatePerPartition` is positively correlated with the size of `spark.executor.cores` * `spark.executor.instances`, and it can be increased while increasing the resource `maxRatePerPartition` to speed up consumption. +-- Check if the database has CDC enabled +SELECT name, is_cdc_enabled +FROM sys.databases +WHERE name = 'database'; -- Replace with the name of your database +``` - +4. Secondly, enable CDC at the table level: +``` +USE TestDB; -- Replace with your actual database name +EXEC sys.sp_cdc_enable_table +@source_schema = 'dbo', +@source_name = 'table', -- Replace with the table name +@role_name = NULL, +@capture_instance = 'table'; -- Replace with a unique capture instance name -## How can I solve the Error `Exception in thread "main" java.lang.NoSuchFieldError: INSTANCE`? +-- Check if the table has CDC enabled +SELECT name, is_tracked_by_cdc +FROM sys.tables +WHERE name = 'table'; -- Replace with the table name +``` -The reason is that the version of httpclient.jar that comes with the CDH version of Spark is lower, and The httpclient version that ClickHouse JDBC is based on is 4.5.2, and the package versions conflict. The solution is to replace the jar package that comes with CDH with the httpclient-4.5.2 version. +## Does SeaTunnel support CDC synchronization for tables without primary keys? +No, CDC synchronization is not supported for tables without primary keys. This is because, if there are two identical rows upstream and one is deleted or modified, it would be impossible to distinguish which row should be deleted or modified downstream, potentially resulting in both rows being affected. -## The default JDK of my Spark cluster is JDK7. After I install JDK8, how can I specify that SeaTunnel starts with JDK8? +## Error during PostgreSQL task execution: Caused by: org.postgresql.util.PSQLException: ERROR: all replication slots are in use +This error occurs when the replication slots in PostgreSQL are full and need to be released. Modify the `postgresql.conf` file to increase `max_wal_senders` and `max_replication_slots`, then restart the PostgreSQL service using the command: +``` +systemctl restart postgresql +``` +Example configuration: +``` +max_wal_senders = 1000 # max number of walsender processes +max_replication_slots = 1000 # max number of replication slots +``` -In SeaTunnel's config file, specify the following configuration: +## What should I do if I have a problem that I can't solve on my own? +If you encounter an issue while using SeaTunnel that you cannot resolve, you can: +1. Search the [issue list](https://github.com/apache/seatunnel/issues) or [mailing list](https://lists.apache.org/[email protected]) to see if someone else has asked the same question and received an answer. +2. If you can't find an answer, reach out to the community for help using [these methods](https://github.com/apache/seatunnel#contact-us). -```shell -spark { - ... - spark.executorEnv.JAVA_HOME="/your/java_8_home/directory" - spark.yarn.appMasterEnv.JAVA_HOME="/your/java_8_home/directory" - ... +## How do I declare variables? +Do you want to know how to declare a variable in a SeaTunnel configuration and dynamically replace its value at runtime? This feature is often used in both scheduled and non-scheduled offline processing as a placeholder for variables such as time and date. Here’s how to do it: +Declare a variable name in the configuration. Below is an example of a SQL transformation (in fact, any value in `key = value` format can use variable substitution): +``` +... +transform { + Sql { + query = "select * from user_view where city ='${city}' and dt = '${date}'" + } } +... ``` - -## What should I do if OOM always appears when running SeaTunnel in Spark local[*] mode? - -If you run in local mode, you need to modify the `start-seatunnel.sh` startup script. After `spark-submit`, add a parameter `--driver-memory 4g` . Under normal circumstances, local mode is not used in the production environment. Therefore, this parameter generally does not need to be set during On YARN. See: [Application Properties](https://spark.apache.org/docs/latest/configuration.html#application-properties) for details. - -## Where can I place self-written plugins or third-party jdbc.jars to be loaded by SeaTunnel? - -Place the Jar package under the specified structure of the plugins directory: - +To run SeaTunnel in Zeta Local mode, use the following command: ```bash -cd SeaTunnel -mkdir -p plugins/my_plugins/lib -cp third-part.jar plugins/my_plugins/lib +$SEATUNNEL_HOME/bin/seatunnel.sh \ +-c $SEATUNNEL_HOME/config/your_app.conf \ +-m local[2] \ +-i city=shanghai \ +-i date=20231110 ``` +Use the `-i` or `--variable` parameter followed by `key=value` to specify the variable's value, ensuring that `key` matches the variable name in the configuration. For more details, refer to: https://seatunnel.apache.org/docs/concept/config Review Comment: ```suggestion Use the `-i` or `--variable` parameter followed by `key=value` to specify the variable's value, ensuring that `key` matches the variable name in the configuration. For more details, refer to: https://seatunnel.apache.org/docs/concept/config/#config-variable-substitution ``` ########## docs/en/faq.md: ########## @@ -1,332 +1,169 @@ -# FAQs +# Frequently Asked Questions -## Why should I install a computing engine like Spark or Flink? +## Do I need to install engines like Spark or Flink to use SeaTunnel? +No, SeaTunnel supports Zeta, Spark, and Flink as options for the integration engine. You can choose one of them. The community especially recommends using Zeta, a new-generation high-performance engine specifically built for integration scenarios. +The community provides the most support for Zeta, which also has richer features. Review Comment: The answer is not always no, it depends on the engine used by the user. If you use Zeta, you need to install the engine, just like Flink/Spark. ########## docs/en/faq.md: ########## @@ -1,332 +1,169 @@ -# FAQs +# Frequently Asked Questions -## Why should I install a computing engine like Spark or Flink? +## Do I need to install engines like Spark or Flink to use SeaTunnel? +No, SeaTunnel supports Zeta, Spark, and Flink as options for the integration engine. You can choose one of them. The community especially recommends using Zeta, a new-generation high-performance engine specifically built for integration scenarios. +The community provides the most support for Zeta, which also has richer features. -SeaTunnel now uses computing engines such as Spark and Flink to complete resource scheduling and node communication, so we can focus on the ease of use of data synchronization and the development of high-performance components. But this is only temporary. +## What data sources and destinations does SeaTunnel support? +SeaTunnel supports a variety of data sources and destinations. You can find the detailed list on the official website: +- Supported data sources (Source): https://seatunnel.apache.org/docs/connector-v2/source +- Supported data destinations (Sink): https://seatunnel.apache.org/docs/connector-v2/sink -## I have a question, and I cannot solve it by myself +## Which data sources currently support CDC (Change Data Capture)? +Currently, CDC is supported for MongoDB CDC, MySQL CDC, OpenGauss CDC, Oracle CDC, PostgreSQL CDC, SQL Server CDC, TiDB CDC, etc. For more details, refer to the [Source](https://seatunnel.apache.org/docs/connector-v2/source) documentation. -I have encountered a problem when using SeaTunnel and I cannot solve it by myself. What should I do? First, search in [Issue List](https://github.com/apache/seatunnel/issues) or [Mailing List](https://lists.apache.org/[email protected]) to see if someone has already asked the same question and got an answer. If you cannot find an answer to your question, you can contact community members for help in [These Ways](https://github.com/apache/seatunnel#contact-us). - -## How do I declare a variable? - -Do you want to know how to declare a variable in SeaTunnel's configuration, and then dynamically replace the value of the variable at runtime? - -Since `v1.2.4`, SeaTunnel supports variable substitution in the configuration. This feature is often used for timing or non-timing offline processing to replace variables such as time and date. The usage is as follows: - -Configure the variable name in the configuration. Here is an example of sql transform (actually, anywhere in the configuration file the value in `'key = value'` can use the variable substitution): +## Does it support CDC from MySQL replica? How is the log fetched? +Yes, it is supported by subscribing to the MySQL binlog and parsing the binlog on the synchronization server. Review Comment: This part already had in https://seatunnel.apache.org/docs/2.3.8/connector-v2/source/MySQL-CDC#enabling-the-mysql-binlog ########## docs/en/faq.md: ########## @@ -1,332 +1,169 @@ -# FAQs +# Frequently Asked Questions -## Why should I install a computing engine like Spark or Flink? +## Do I need to install engines like Spark or Flink to use SeaTunnel? +No, SeaTunnel supports Zeta, Spark, and Flink as options for the integration engine. You can choose one of them. The community especially recommends using Zeta, a new-generation high-performance engine specifically built for integration scenarios. +The community provides the most support for Zeta, which also has richer features. -SeaTunnel now uses computing engines such as Spark and Flink to complete resource scheduling and node communication, so we can focus on the ease of use of data synchronization and the development of high-performance components. But this is only temporary. +## What data sources and destinations does SeaTunnel support? +SeaTunnel supports a variety of data sources and destinations. You can find the detailed list on the official website: +- Supported data sources (Source): https://seatunnel.apache.org/docs/connector-v2/source +- Supported data destinations (Sink): https://seatunnel.apache.org/docs/connector-v2/sink -## I have a question, and I cannot solve it by myself +## Which data sources currently support CDC (Change Data Capture)? +Currently, CDC is supported for MongoDB CDC, MySQL CDC, OpenGauss CDC, Oracle CDC, PostgreSQL CDC, SQL Server CDC, TiDB CDC, etc. For more details, refer to the [Source](https://seatunnel.apache.org/docs/connector-v2/source) documentation. -I have encountered a problem when using SeaTunnel and I cannot solve it by myself. What should I do? First, search in [Issue List](https://github.com/apache/seatunnel/issues) or [Mailing List](https://lists.apache.org/[email protected]) to see if someone has already asked the same question and got an answer. If you cannot find an answer to your question, you can contact community members for help in [These Ways](https://github.com/apache/seatunnel#contact-us). - -## How do I declare a variable? - -Do you want to know how to declare a variable in SeaTunnel's configuration, and then dynamically replace the value of the variable at runtime? - -Since `v1.2.4`, SeaTunnel supports variable substitution in the configuration. This feature is often used for timing or non-timing offline processing to replace variables such as time and date. The usage is as follows: - -Configure the variable name in the configuration. Here is an example of sql transform (actually, anywhere in the configuration file the value in `'key = value'` can use the variable substitution): +## Does it support CDC from MySQL replica? How is the log fetched? +Yes, it is supported by subscribing to the MySQL binlog and parsing the binlog on the synchronization server. +## What permissions are required for MySQL CDC synchronization and how to enable them? +You need `SELECT` permission on the relevant databases and tables. +1. The authorization statement is as follows: ``` -... -transform { - sql { - query = "select * from user_view where city ='"${city}"' and dt = '"${date}"'" - } -} -... +GRANT SHOW DATABASES, REPLICATION SLAVE, REPLICATION CLIENT ON *.* TO 'username'@'host' IDENTIFIED BY 'password'; +FLUSH PRIVILEGES; ``` -Taking Spark Local mode as an example, the startup command is as follows: - -```bash -./bin/start-seatunnel-spark.sh \ --c ./config/your_app.conf \ --e client \ --m local[2] \ --i city=shanghai \ --i date=20190319 +2. Edit `/etc/mysql/my.cnf` and add the following lines: +``` +[mysqld] +log-bin=/var/log/mysql/mysql-bin.log +expire_logs_days = 7 +binlog_format = ROW +binlog_row_image=full ``` -You can use the parameter `-i` or `--variable` followed by `key=value` to specify the value of the variable, where the key needs to be same as the variable name in the configuration. - -## How do I write a configuration item in multi-line text in the configuration file? - -When a configured text is very long and you want to wrap it, you can use three double quotes to indicate its start and end: - +3. Restart the MySQL service: ``` -var = """ - whatever you want -""" +service mysql restart ``` -## How do I implement variable substitution for multi-line text? - -It is a little troublesome to do variable substitution in multi-line text, because the variable cannot be included in three double quotation marks: +## What permissions are required for SQL Server CDC synchronization and how to enable them? +Using SQL Server CDC as a data source requires enabling the MS-CDC feature in SQL Server. The steps are as follows: Review Comment: ditto ########## docs/en/faq.md: ########## @@ -1,332 +1,169 @@ -# FAQs +# Frequently Asked Questions -## Why should I install a computing engine like Spark or Flink? +## Do I need to install engines like Spark or Flink to use SeaTunnel? +No, SeaTunnel supports Zeta, Spark, and Flink as options for the integration engine. You can choose one of them. The community especially recommends using Zeta, a new-generation high-performance engine specifically built for integration scenarios. +The community provides the most support for Zeta, which also has richer features. -SeaTunnel now uses computing engines such as Spark and Flink to complete resource scheduling and node communication, so we can focus on the ease of use of data synchronization and the development of high-performance components. But this is only temporary. +## What data sources and destinations does SeaTunnel support? +SeaTunnel supports a variety of data sources and destinations. You can find the detailed list on the official website: +- Supported data sources (Source): https://seatunnel.apache.org/docs/connector-v2/source +- Supported data destinations (Sink): https://seatunnel.apache.org/docs/connector-v2/sink -## I have a question, and I cannot solve it by myself +## Which data sources currently support CDC (Change Data Capture)? +Currently, CDC is supported for MongoDB CDC, MySQL CDC, OpenGauss CDC, Oracle CDC, PostgreSQL CDC, SQL Server CDC, TiDB CDC, etc. For more details, refer to the [Source](https://seatunnel.apache.org/docs/connector-v2/source) documentation. -I have encountered a problem when using SeaTunnel and I cannot solve it by myself. What should I do? First, search in [Issue List](https://github.com/apache/seatunnel/issues) or [Mailing List](https://lists.apache.org/[email protected]) to see if someone has already asked the same question and got an answer. If you cannot find an answer to your question, you can contact community members for help in [These Ways](https://github.com/apache/seatunnel#contact-us). - -## How do I declare a variable? - -Do you want to know how to declare a variable in SeaTunnel's configuration, and then dynamically replace the value of the variable at runtime? - -Since `v1.2.4`, SeaTunnel supports variable substitution in the configuration. This feature is often used for timing or non-timing offline processing to replace variables such as time and date. The usage is as follows: - -Configure the variable name in the configuration. Here is an example of sql transform (actually, anywhere in the configuration file the value in `'key = value'` can use the variable substitution): +## Does it support CDC from MySQL replica? How is the log fetched? +Yes, it is supported by subscribing to the MySQL binlog and parsing the binlog on the synchronization server. +## What permissions are required for MySQL CDC synchronization and how to enable them? Review Comment: ditto. ########## docs/en/faq.md: ########## @@ -1,332 +1,169 @@ -# FAQs +# Frequently Asked Questions -## Why should I install a computing engine like Spark or Flink? +## Do I need to install engines like Spark or Flink to use SeaTunnel? +No, SeaTunnel supports Zeta, Spark, and Flink as options for the integration engine. You can choose one of them. The community especially recommends using Zeta, a new-generation high-performance engine specifically built for integration scenarios. +The community provides the most support for Zeta, which also has richer features. -SeaTunnel now uses computing engines such as Spark and Flink to complete resource scheduling and node communication, so we can focus on the ease of use of data synchronization and the development of high-performance components. But this is only temporary. +## What data sources and destinations does SeaTunnel support? +SeaTunnel supports a variety of data sources and destinations. You can find the detailed list on the official website: +- Supported data sources (Source): https://seatunnel.apache.org/docs/connector-v2/source +- Supported data destinations (Sink): https://seatunnel.apache.org/docs/connector-v2/sink -## I have a question, and I cannot solve it by myself +## Which data sources currently support CDC (Change Data Capture)? +Currently, CDC is supported for MongoDB CDC, MySQL CDC, OpenGauss CDC, Oracle CDC, PostgreSQL CDC, SQL Server CDC, TiDB CDC, etc. For more details, refer to the [Source](https://seatunnel.apache.org/docs/connector-v2/source) documentation. -I have encountered a problem when using SeaTunnel and I cannot solve it by myself. What should I do? First, search in [Issue List](https://github.com/apache/seatunnel/issues) or [Mailing List](https://lists.apache.org/[email protected]) to see if someone has already asked the same question and got an answer. If you cannot find an answer to your question, you can contact community members for help in [These Ways](https://github.com/apache/seatunnel#contact-us). - -## How do I declare a variable? - -Do you want to know how to declare a variable in SeaTunnel's configuration, and then dynamically replace the value of the variable at runtime? - -Since `v1.2.4`, SeaTunnel supports variable substitution in the configuration. This feature is often used for timing or non-timing offline processing to replace variables such as time and date. The usage is as follows: - -Configure the variable name in the configuration. Here is an example of sql transform (actually, anywhere in the configuration file the value in `'key = value'` can use the variable substitution): +## Does it support CDC from MySQL replica? How is the log fetched? +Yes, it is supported by subscribing to the MySQL binlog and parsing the binlog on the synchronization server. +## What permissions are required for MySQL CDC synchronization and how to enable them? +You need `SELECT` permission on the relevant databases and tables. +1. The authorization statement is as follows: ``` -... -transform { - sql { - query = "select * from user_view where city ='"${city}"' and dt = '"${date}"'" - } -} -... +GRANT SHOW DATABASES, REPLICATION SLAVE, REPLICATION CLIENT ON *.* TO 'username'@'host' IDENTIFIED BY 'password'; +FLUSH PRIVILEGES; ``` -Taking Spark Local mode as an example, the startup command is as follows: - -```bash -./bin/start-seatunnel-spark.sh \ --c ./config/your_app.conf \ --e client \ --m local[2] \ --i city=shanghai \ --i date=20190319 +2. Edit `/etc/mysql/my.cnf` and add the following lines: +``` +[mysqld] +log-bin=/var/log/mysql/mysql-bin.log +expire_logs_days = 7 +binlog_format = ROW +binlog_row_image=full ``` -You can use the parameter `-i` or `--variable` followed by `key=value` to specify the value of the variable, where the key needs to be same as the variable name in the configuration. - -## How do I write a configuration item in multi-line text in the configuration file? - -When a configured text is very long and you want to wrap it, you can use three double quotes to indicate its start and end: - +3. Restart the MySQL service: ``` -var = """ - whatever you want -""" +service mysql restart ``` -## How do I implement variable substitution for multi-line text? - -It is a little troublesome to do variable substitution in multi-line text, because the variable cannot be included in three double quotation marks: +## What permissions are required for SQL Server CDC synchronization and how to enable them? +Using SQL Server CDC as a data source requires enabling the MS-CDC feature in SQL Server. The steps are as follows: +1. Check if the SQL Server CDC Agent is running: ``` -var = """ -your string 1 -"""${you_var}""" your string 2""" +EXEC xp_servicecontrol N'querystate', N'SQLServerAGENT'; +-- If the result is "running," it means the agent is enabled. Otherwise, it needs to be started manually. ``` -Refer to: [lightbend/config#456](https://github.com/lightbend/config/issues/456). - -## Is SeaTunnel supported in Azkaban, Oozie, DolphinScheduler? - -Of course! See the screenshot below: - - - - - -## Does SeaTunnel have a case for configuring multiple sources, such as configuring elasticsearch and hdfs in source at the same time? - +2. If using Linux, enable the SQL Server CDC Agent: ``` -env { - ... -} - -source { - hdfs { ... } - elasticsearch { ... } - jdbc {...} -} - -transform { - ... -} - -sink { - elasticsearch { ... } -} +/opt/mssql/bin/mssql-conf setup +The result that is returned is as follows: +1) Evaluation (free, no production use rights, 180-day limit) +2) Developer (free, no production use rights) +3) Express (free) +4) Web (PAID) +5) Standard (PAID) +6) Enterprise (PAID) +7) Enterprise Core (PAID) +8) I bought a license through a retail sales channel and have a product key to enter. ``` - -## Are there any HBase plugins? - -There is a HBase input plugin. You can download it from here: https://github.com/garyelephant/waterdrop-input-hbase . - -## How can I use SeaTunnel to write data to Hive? - +Choose the appropriate option based on your situation. +Select option 2 (Developer) for a free version that includes the agent. Enable the agent by running: ``` -env { - spark.sql.catalogImplementation = "hive" - spark.hadoop.hive.exec.dynamic.partition = "true" - spark.hadoop.hive.exec.dynamic.partition.mode = "nonstrict" -} - -source { - sql = "insert into ..." -} - -sink { - // The data has been written to hive through the sql source. This is just a placeholder, it does not actually work. - stdout { - limit = 1 - } -} +/opt/mssql/bin/mssql-conf set sqlagent.enabled true ``` -In addition, SeaTunnel has implemented a `Hive` output plugin after version `1.5.7` in `1.x` branch; in `2.x` branch. The Hive plugin for the Spark engine has been supported from version `2.0.5`: https://github.com/apache/seatunnel/issues/910. - -## How does SeaTunnel write multiple instances of ClickHouse to achieve load balancing? - -1. Write distributed tables directly (not recommended) - -2. Add a proxy or domain name (DNS) in front of multiple instances of ClickHouse: - - ``` - { - output { - clickhouse { - host = "ck-proxy.xx.xx:8123" - # Local table - table = "table_name" - } - } - } - ``` -3. Configure multiple instances in the configuration: - - ``` - { - output { - clickhouse { - host = "ck1:8123,ck2:8123,ck3:8123" - # Local table - table = "table_name" - } - } - } - ``` -4. Use cluster mode: - - ``` - { - output { - clickhouse { - # Configure only one host - host = "ck1:8123" - cluster = "clickhouse_cluster_name" - # Local table - table = "table_name" - } - } - } - ``` - -## How can I solve OOM when SeaTunnel consumes Kafka? - -In most cases, OOM is caused by not having a rate limit for consumption. The solution is as follows: - -For the current limit of Spark consumption of Kafka: - -1. Suppose the number of partitions of Kafka `Topic 1` you consume with KafkaStream = N. - -2. Assuming that the production speed of the message producer (Producer) of `Topic 1` is K messages/second, the speed of write messages to the partition must be uniform. - -3. Suppose that, after testing, it is found that the processing capacity of Spark Executor per core per second is M. - -The following conclusions can be drawn: - -1. If you want to make Spark's consumption of `Topic 1` keep up with its production speed, then you need `spark.executor.cores` * `spark.executor.instances` >= K / M +If using Windows, enable SQL Server Agent (e.g., for SQL Server 2008): + - Refer to the [official documentation](https://learn.microsoft.com/en-us/previous-versions/sql/sql-server-2008-r2/ms191454(v=sql.105)). +``` +Open "SQL Server Configuration Manager" from the Start menu, navigate to "SQL Server Services," right-click the "SQL Server Agent" instance, and start it. +``` -2. When a data delay occurs, if you want the consumption speed not to be too fast, resulting in spark executor OOM, then you need to configure `spark.streaming.kafka.maxRatePerPartition` <= (`spark.executor.cores` * `spark.executor.instances`) * M / N +3. Firstly, enable CDC at the database level: +``` +USE TestDB; -- Replace with your actual database name +EXEC sys.sp_cdc_enable_db; -3. In general, both M and N are determined, and the conclusion can be drawn from 2: The size of `spark.streaming.kafka.maxRatePerPartition` is positively correlated with the size of `spark.executor.cores` * `spark.executor.instances`, and it can be increased while increasing the resource `maxRatePerPartition` to speed up consumption. +-- Check if the database has CDC enabled +SELECT name, is_cdc_enabled +FROM sys.databases +WHERE name = 'database'; -- Replace with the name of your database +``` - +4. Secondly, enable CDC at the table level: +``` +USE TestDB; -- Replace with your actual database name +EXEC sys.sp_cdc_enable_table +@source_schema = 'dbo', +@source_name = 'table', -- Replace with the table name +@role_name = NULL, +@capture_instance = 'table'; -- Replace with a unique capture instance name -## How can I solve the Error `Exception in thread "main" java.lang.NoSuchFieldError: INSTANCE`? +-- Check if the table has CDC enabled +SELECT name, is_tracked_by_cdc +FROM sys.tables +WHERE name = 'table'; -- Replace with the table name +``` -The reason is that the version of httpclient.jar that comes with the CDH version of Spark is lower, and The httpclient version that ClickHouse JDBC is based on is 4.5.2, and the package versions conflict. The solution is to replace the jar package that comes with CDH with the httpclient-4.5.2 version. +## Does SeaTunnel support CDC synchronization for tables without primary keys? +No, CDC synchronization is not supported for tables without primary keys. This is because, if there are two identical rows upstream and one is deleted or modified, it would be impossible to distinguish which row should be deleted or modified downstream, potentially resulting in both rows being affected. -## The default JDK of my Spark cluster is JDK7. After I install JDK8, how can I specify that SeaTunnel starts with JDK8? +## Error during PostgreSQL task execution: Caused by: org.postgresql.util.PSQLException: ERROR: all replication slots are in use Review Comment: It's strange to put connector related questions here, why not put them on the connector's own page. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
