[I] 在海豚调度中使用seatunnel采集MySQL数据到ClickHouse问题 [seatunnel]

via GitHub Tue, 28 May 2024 20:44:12 -0700


lyshyhuangli opened a new issue, #6920:
URL: https://github.com/apache/seatunnel/issues/6920


   ### Search before asking
   
   - [X] I had searched in the 
[issues](https://github.com/apache/seatunnel/issues?q=is%3Aissue+label%3A%22bug%22)
 and found no similar issues.
   
   
   ### What happened
   
   在海豚调度(3.1.2版本)中使用seatunnel采集MySQL数据到ClickHouse：
   
   1.MySQL中的一个表的数据量为1.4亿条，一次全量采集到ClickHouse失败；
   2.在同等配置的情况下，如果分批次采集，一批次大约1.1千万条 是可以采集成功的；
   
   我想问的是：
   
1、在全量一次性采集的情况下，如果是说MySQL中的这边表的数据库太大的话，在spark中分配的内存不足以存放下这么多数据的话，能否像kettle一样，一部分一部分把数据采集到ClickHouse中，
 不是直接中断ClickHouse的写操作了；
   2、如果seatunnel能像kettle一样，部分部分采集的话，那是否是我哪里的配置有问题？
   
   
   ### SeaTunnel Version
   
   2.3.0
   
   ### SeaTunnel Config
   
   ```conf
   "env" : {
                "spark.app.name" : "ods_t3070220000029_000203_v5_all",
                "spark.executor.instances" : 2,
                "spark.executor.cores" : 6,
                "spark.executor.memory" : "10g",
                "spark.network.timeout" : 10000000,
                "spark.executor.heartbeatInterval" : 1000000,
                "spark.yarn.executor.memoryOverhead" : 1024,
                "spark.yarn.driver.memoryOverhead" : 1024,
                "spark.yarn.max.executor.failures" : 4,
                "spark.task.cpus" : 4
            },
   ```
   
   
   ### Running Command
   
   ```shell
   ${SPARK_HOME}/bin/spark-submit --class 
"org.apache.seatunnel.core.starter.spark.SeatunnelSpark" --name "SeaTunnel" 
--master "yarn" --deploy-mode "client" --jars 
"/opt/server/seatunnel/plugins/jdbc/lib/ali-phoenix-shaded-thin-client-5.2.5-HBase-2.x.jar,/opt/server/seatunnel/plugins/jdbc/lib/mysql-connector-java-8.0.27.jar,/opt/server/seatunnel/plugins/jdbc/lib/postgresql-42.4.3.jar,/opt/server/seatunnel/plugins/jdbc/lib/DmJdbcDriver18-8.1.2.141.jar,/opt/server/seatunnel/plugins/jdbc/lib/mssql-jdbc-9.2.1.jre8.jar,/opt/server/seatunnel/plugins/jdbc/lib/ojdbc8-12.2.0.1.jar,/opt/server/seatunnel/plugins/jdbc/lib/sqlite-jdbc-3.39.3.0.jar,/opt/server/seatunnel/plugins/jdbc/lib/db2jcc-db2jcc4.jar,/opt/server/seatunnel/plugins/jdbc/lib/tablestore-jdbc-5.13.9.jar,/opt/server/seatunnel/plugins/jdbc/lib/terajdbc4-17.20.00.12.jar,/opt/server/seatunnel/plugins/jdbc/lib/redshift-jdbc42-2.1.0.9.jar,/opt/server/seatunnel/lib/seatunnel-transforms-v2.jar,/opt/server/seatunnel/lib/hadoop-aws-3.1.4.ja
 
r,/opt/server/seatunnel/lib/aws-java-sdk-bundle-1.11.271.jar,/opt/server/seatunnel/lib/seatunnel-hadoop3-3.1.4-uber-2.3.0-2.11.12.jar,/opt/server/seatunnel/connectors/seatunnel/connector-clickhouse-2.3.0.jar,/opt/server/seatunnel/connectors/seatunnel/connector-jdbc-2.3.0.jar"
 --conf "spark.executor.memory=10g" --conf "spark.task.cpus=4" --conf 
"spark.yarn.driver.memoryOverhead=1024" --conf 
"spark.executor.heartbeatInterval=1000000" --conf 
"spark.yarn.max.executor.failures=4" --conf "spark.network.timeout=10000000" 
--conf "spark.executor.cores=6" --conf 
"spark.app.name=ods_t3070220000029_000203_v5_all" --conf 
"spark.yarn.executor.memoryOverhead=1024" --conf "spark.executor.instances=2" 
/opt/server/seatunnel/starter/seatunnel-spark-starter.jar --config 
"/tmp/dolphinscheduler/exec/process/dps/13065104481120/13124201617382_13/1100/1782/seatunnel_1100_1782.conf"
 --master "yarn" --deploy-mode "client"
   ```
   
   
   ### Error Exception
   
   ```log
   24/04/02 13:34:19 ERROR v2.WriteToDataSourceV2Exec: Data source writer 
org.apache.seatunnel.translation.spark.sink.SparkDataSourceWriter@71d0b8a4 is 
aborting.
        24/04/02 13:34:19 ERROR v2.WriteToDataSourceV2Exec: Data source writer 
org.apache.seatunnel.translation.spark.sink.SparkDataSourceWriter@71d0b8a4 
aborted.
        24/04/02 13:34:19 ERROR command.SparkApiTaskExecuteCommand: Run 
SeaTunnel on spark failed.
        org.apache.spark.SparkException: Writing job aborted.
                at 
org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec.doExecute(WriteToDataSourceV2Exec.scala:92)
                at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
                at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
                at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
                at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
                at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
                at 
org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
                at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
                at 
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
                at 
org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
                at 
org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
                at 
org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
                at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
                at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
                at 
org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:668)
                at 
org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:259)
                at 
org.apache.seatunnel.core.starter.spark.execution.SinkExecuteProcessor.execute(SinkExecuteProcessor.java:85)
                at 
org.apache.seatunnel.core.starter.spark.execution.SparkExecution.execute(SparkExecution.java:61)
                at 
org.apache.seatunnel.core.starter.spark.command.SparkApiTaskExecuteCommand.execute(SparkApiTaskExecuteCommand.java:55)
                at 
org.apache.seatunnel.core.starter.Seatunnel.run(Seatunnel.java:39)
                at 
org.apache.seatunnel.core.starter.spark.SeatunnelSpark.main(SeatunnelSpark.java:34)
                at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
                at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
                at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
                at java.lang.reflect.Method.invoke(Method.java:498)
                at 
org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
                at 
org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:851)
                at 
org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:167)
                at 
org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:195)
                at 
org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
                at 
org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:926)
                at 
org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:935)
                at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
        Caused by: org.apache.spark.SparkException: Job aborted due to stage 
failure: 
        Aborting TaskSet 0.0 because task 0 (partition 0)
   ```
   
   
   ### Zeta or Flink or Spark Version
   
    Spark 2.3
   
   ### Java or Scala Version
   
   java 1.8
   
   ### Screenshots
   
   _No response_
   
   ### Are you willing to submit PR?
   
   - [X] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] 在海豚调度中使用seatunnel采集MySQL数据到ClickHouse问题 [seatunnel]

Reply via email to