This is an automated email from the ASF dual-hosted git repository. lidongdai pushed a commit to branch davidzollo-patch-3 in repository https://gitbox.apache.org/repos/asf/seatunnel-website.git
commit 347ac32d788394dc06129cf6a90e625fd4c53064 Author: David Zollo <[email protected]> AuthorDate: Sat Jun 24 11:58:30 2023 +0800 Update faq.md --- versioned_docs/version-2.3.2/faq.md | 177 ------------------------------------ 1 file changed, 177 deletions(-) diff --git a/versioned_docs/version-2.3.2/faq.md b/versioned_docs/version-2.3.2/faq.md index 6903043e087..58945d3b02d 100644 --- a/versioned_docs/version-2.3.2/faq.md +++ b/versioned_docs/version-2.3.2/faq.md @@ -1,8 +1,5 @@ # FAQs -## Why should I install a computing engine like Spark or Flink? - -SeaTunnel now uses computing engines such as Spark and Flink to complete resource scheduling and node communication, so we can focus on the ease of use of data synchronization and the development of high-performance components. But this is only temporary. ## I have a question, and I cannot solve it by myself @@ -61,13 +58,6 @@ your string 1 Refer to: [lightbend/config#456](https://github.com/lightbend/config/issues/456). -## Is SeaTunnel supportted in Azkaban, Oozie, DolphinScheduler? - -Of course! See the screenshot below: - - - - ## Does SeaTunnel have a case for configuring multiple sources, such as configuring elasticsearch and hdfs in source at the same time? @@ -91,117 +81,6 @@ sink { } ``` -## Are there any HBase plugins? - -There is an hbase input plugin. You can download it from here: https://github.com/garyelephant/waterdrop-input-hbase . - -## How can I use SeaTunnel to write data to Hive? - -``` -env { - spark.sql.catalogImplementation = "hive" - spark.hadoop.hive.exec.dynamic.partition = "true" - spark.hadoop.hive.exec.dynamic.partition.mode = "nonstrict" -} - -source { - sql = "insert into ..." -} - -sink { - // The data has been written to hive through the sql source. This is just a placeholder, it does not actually work. - stdout { - limit = 1 - } -} -``` - -In addition, SeaTunnel has implemented a `Hive` output plugin after version `1.5.7` in `1.x` branch; in `2.x` branch. The Hive plugin for the Spark engine has been supported from version `2.0.5`: https://github.com/apache/seatunnel/issues/910. - -## How does SeaTunnel write multiple instances of ClickHouse to achieve load balancing? - -1. Write distributed tables directly (not recommended) - -2. Add a proxy or domain name (DNS) in front of multiple instances of ClickHouse: - - ``` - { - output { - clickhouse { - host = "ck-proxy.xx.xx:8123" - # Local table - table = "table_name" - } - } - } - ``` -3. Configure multiple instances in the configuration: - - ``` - { - output { - clickhouse { - host = "ck1:8123,ck2:8123,ck3:8123" - # Local table - table = "table_name" - } - } - } - ``` -4. Use cluster mode: - - ``` - { - output { - clickhouse { - # Configure only one host - host = "ck1:8123" - cluster = "clickhouse_cluster_name" - # Local table - table = "table_name" - } - } - } - ``` - -## How can I solve OOM when SeaTunnel consumes Kafka? - -In most cases, OOM is caused by not having a rate limit for consumption. The solution is as follows: - -For the current limit of Spark consumption of Kafka: - -1. Suppose the number of partitions of Kafka `Topic 1` you consume with KafkaStream = N. - -2. Assuming that the production speed of the message producer (Producer) of `Topic 1` is K messages/second, the speed of write messages to the partition must be uniform. - -3. Suppose that, after testing, it is found that the processing capacity of Spark Executor per core per second is M. - -The following conclusions can be drawn: - -1. If you want to make Spark's consumption of `Topic 1` keep up with its production speed, then you need `spark.executor.cores` * `spark.executor.instances` >= K / M - -2. When a data delay occurs, if you want the consumption speed not to be too fast, resulting in spark executor OOM, then you need to configure `spark.streaming.kafka.maxRatePerPartition` <= (`spark.executor.cores` * `spark.executor.instances`) * M / N - -3. In general, both M and N are determined, and the conclusion can be drawn from 2: The size of `spark.streaming.kafka.maxRatePerPartition` is positively correlated with the size of `spark.executor.cores` * `spark.executor.instances`, and it can be increased while increasing the resource `maxRatePerPartition` to speed up consumption. - - - -## How can I solve the Error `Exception in thread "main" java.lang.NoSuchFieldError: INSTANCE`? - -The reason is that the version of httpclient.jar that comes with the CDH version of Spark is lower, and The httpclient version that ClickHouse JDBC is based on is 4.5.2, and the package versions conflict. The solution is to replace the jar package that comes with CDH with the httpclient-4.5.2 version. - -## The default JDK of my Spark cluster is JDK7. After I install JDK8, how can I specify that SeaTunnel starts with JDK8? - -In SeaTunnel's config file, specify the following configuration: - -```shell -spark { - ... - spark.executorEnv.JAVA_HOME="/your/java_8_home/directory" - spark.yarn.appMasterEnv.JAVA_HOME="/your/java_8_home/directory" - ... -} -``` ## How do I specify a different JDK version for SeaTunnel on Yarn? @@ -224,17 +103,6 @@ For example, if you want to set the JDK version to JDK8, there are two cases: If you run in local mode, you need to modify the `start-seatunnel.sh` startup script. After `spark-submit`, add a parameter `--driver-memory 4g` . Under normal circumstances, local mode is not used in the production environment. Therefore, this parameter generally does not need to be set during On Yarn. See: [Application Properties](https://spark.apache.org/docs/latest/configuration.html#application-properties) for details. -## Where can I place self-written plugins or third-party jdbc.jars to be loaded by SeaTunnel? - -Place the Jar package under the specified structure of the plugins directory: - -```bash -cd SeaTunnel -mkdir -p plugins/my_plugins/lib -cp third-part.jar plugins/my_plugins/lib -``` - -`my_plugins` can be any string. ## How do I configure logging-related parameters in SeaTunnel-v1(Spark)? @@ -298,50 +166,5 @@ http://spark.apache.org/docs/latest/configuration.html#configuring-logging https://medium.com/@iacomini.riccardo/spark-logging-configuration-in-yarn-faf5ba5fdb01 -## Error when writing to ClickHouse: ClassCastException - -In SeaTunnel, the data type will not be actively converted. After the Input reads the data, the corresponding -Schema. When writing ClickHouse, the field type needs to be strictly matched, and the mismatch needs to be resolved. - -Data conversion can be achieved through the following two plug-ins: - -1. Filter Convert plugin -2. Filter Sql plugin - -Detailed data type conversion reference: [ClickHouse Data Type Check List](https://interestinglab.github.io/seatunnel-docs/#/en/configuration/output-plugins/Clickhouse?id=clickhouse-data-type-check-list) - -Refer to issue:[#488](https://github.com/apache/seatunnel/issues/488) [#382](https://github.com/apache/seatunnel/issues/382). - -## How does SeaTunnel access kerberos-authenticated HDFS, YARN, Hive and other resources? - -Please refer to: [#590](https://github.com/apache/seatunnel/issues/590). - -## How do I troubleshoot NoClassDefFoundError, ClassNotFoundException and other issues? - -There is a high probability that there are multiple different versions of the corresponding Jar package class loaded in the Java classpath, because of the conflict of the load order, not because the Jar is really missing. Modify this SeaTunnel startup command, adding the following parameters to the spark-submit submission section, and debug in detail through the output log. - -``` -spark-submit --verbose - ... - --conf 'spark.driver.extraJavaOptions=-verbose:class' - --conf 'spark.executor.extraJavaOptions=-verbose:class' - ... -``` - -## How do I use SeaTunnel to synchronize data across HDFS clusters? - -Just configure hdfs-site.xml properly. Refer to: https://www.cnblogs.com/suanec/p/7828139.html. - -## I want to learn the source code of SeaTunnel. Where should I start? - -SeaTunnel has a completely abstract and structured code implementation, and many people have chosen SeaTunnel As a way to learn Spark. You can learn the source code from the main program entry: Seatunnel.java - -## When SeaTunnel developers develop their own plugins, do they need to understand the SeaTunnel code? Should these plugins be integrated into the SeaTunnel project? - -The plugin developed by the developer has nothing to do with the SeaTunnel project and does not need to include your plugin code. - -The plugin can be completely independent from SeaTunnel project, so you can write it using Java, Scala, Maven, sbt, Gradle, or whatever you want. This is also the way we recommend developers to develop plugins. -## When I import a project, the compiler has the exception "class not found `org.apache.seatunnel.shade.com.typesafe.config.Config`" -Run `mvn install` first. In the `seatunnel-config/seatunnel-config-base` subproject, the package `com.typesafe.config` has been relocated to `org.apache.seatunnel.shade.com.typesafe.config` and installed to the maven local repository in the subproject `seatunnel-config/seatunnel-config-shade`.
