[GitHub] [hudi] HuangFru opened a new issue, #8332: [SUPPORT] Spark insert overwrite causes executors OOM.

via GitHub Thu, 30 Mar 2023 19:49:44 -0700


HuangFru opened a new issue, #8332:
URL: https://github.com/apache/hudi/issues/8332


   **Describe the problem you faced**
   
   I'm doing a simple write performance test for Hudi in Spark on Yarn, but my 
executors will be dead for OOM. And the 'insert overwrite' SQL could be very 
slow.
   I've created a table like this:
   ```
   create table lineitem_kp_mor (
      l_orderkey bigint,
      l_partkey bigint,
      l_suppkey bigint,
      l_linenumber int,
      l_quantity double,
      l_extendedprice double,
      l_discount double,
      l_tax double,
      l_returnflag string,
      l_linestatus string,
      l_shipdate date,
      l_receiptdate date,
      l_shipinstruct string,
      l_shipmode string,
      l_comment string,
      l_commitdate date
   ) USING hudi tblproperties (
        type = 'mor',
        primaryKey = 'l_orderkey,l_linenumber',
        hoodie.memory.merge.max.size = '2004857600000',
        hoodie.write.markers.type = 'DIRECT',
        hoodie.sql.insert.mode = 'non-strict'
   ) PARTITIONED BY (l_commitdate);
   ```
   This is one of the TPC-H Benchmark's table `lineitem`, and I add a partition 
key 'l_commitdate' for it. Then the table will have 2466 partitions.
   Then I'm trying to write into the table throw 'insert overwrite as select 
from' in Spark SQL. The data size is 'sf100', with about 600 million rows of 
data, and 15GB in storage.
   
   Here's my command submit to yarn:
   ```
   bin/beeline -u 
'jdbc:hive2://localhost:10010/;#spark.master=yarn;spark.yarn.queue=default;spark.executor.instances=11;spark.executor.cores=5;spark.executor.memory=8g;spark.driver.memory=4g'
   ```
   Resources: 11 executors, each 8G memory, each 5 cores, driver memory 4G.
   
   The insert overwrite SQL:
   ```
   insert overwrite table lineitem_kp_mor select 
l_orderkey,l_partkey,l_suppkey,l_linenumber,l_quantity,l_extendedprice,l_discount,l_tax,l_returnflag,l_linestatus,l_shipdate,l_receiptdate,l_shipinstruct,l_shipmode,l_comment,l_commitdate
 from tpch.sf100.lineitem;
   ```
   The SQL run very slow and will failed due to executor dead.
   Executors:
   
![image](https://user-images.githubusercontent.com/68625618/229009065-176aa657-e4ca-4dc4-a627-37ff2e7cbaeb.png)
   Stages:
   
![image](https://user-images.githubusercontent.com/68625618/229010340-c31e5728-2a26-4ef2-93c2-84e7da185f4b.png)
   Dead Executors' log:
   
![image](https://user-images.githubusercontent.com/68625618/229009250-0b2f2bf1-2b5f-42a4-b21c-c5f623c59fb0.png)
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Prepare a TPC-H data set in Spark.
   2.Run Spark sql on Yarn and do create table and insert overwrite like my SQL.
   
   **Expected behavior**
   
   I don't know if I've missed some important config of something. I'm new to 
Hudi. Could you please give me some help?
   
   **Environment Description**
   
   * Hudi version : 0.13.0
   
   * Spark version : 3.1.3
   
   * Hive version : 2.1.1
   
   * Hadoop version : 2.7.3
   
   * Storage (HDFS/S3/GCS..) : HDFS
   
   * Running on Docker? (yes/no) : no
   
   
   **Additional context**
   
   Add any other context about the problem here.
   
   **Stacktrace**
   
   ```Driver stacktrace:
        at 
org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2303)
        at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2252)
        at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2251)
        at 
scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
        at 
scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
        at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2251)
        at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1124)
        at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1124)
        at scala.Option.foreach(Option.scala:407)
        at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1124)
        at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2490)
        at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2432)
        at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2421)
        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
        at 
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:902)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2196)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2217)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2236)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2261)
        at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1030)
        at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
        at org.apache.spark.rdd.RDD.collect(RDD.scala:1029)
        at 
org.apache.spark.rdd.PairRDDFunctions.$anonfun$countByKey$1(PairRDDFunctions.scala:366)
        at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
        at 
org.apache.spark.rdd.PairRDDFunctions.countByKey(PairRDDFunctions.scala:366)
        at 
org.apache.spark.api.java.JavaPairRDD.countByKey(JavaPairRDD.scala:314)
        at 
org.apache.hudi.data.HoodieJavaPairRDD.countByKey(HoodieJavaPairRDD.java:104)
        at 
org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.buildProfile(BaseSparkCommitActionExecutor.java:187)
        at 
org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.execute(BaseSparkCommitActionExecutor.java:156)
        at 
org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.execute(BaseSparkCommitActionExecutor.java:85)
        at 
org.apache.hudi.table.action.commit.BaseWriteHelper.write(BaseWriteHelper.java:57)
        ... 36 more
   
        at 
org.apache.kyuubi.KyuubiSQLException$.apply(KyuubiSQLException.scala:69)
        at 
org.apache.kyuubi.engine.spark.operation.SparkOperation$$anonfun$onError$1.applyOrElse(SparkOperation.scala:112)
        at 
org.apache.kyuubi.engine.spark.operation.SparkOperation$$anonfun$onError$1.applyOrElse(SparkOperation.scala:96)
        at 
scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38)
        at 
org.apache.kyuubi.engine.spark.operation.ExecuteStatement.$anonfun$executeStatement$1(ExecuteStatement.scala:113)
        at 
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
        at 
org.apache.kyuubi.engine.spark.operation.SparkOperation.withLocalProperties(SparkOperation.scala:87)
        at 
org.apache.kyuubi.engine.spark.operation.ExecuteStatement.org$apache$kyuubi$engine$spark$operation$ExecuteStatement$$executeStatement(ExecuteStatement.scala:89)
        at 
org.apache.kyuubi.engine.spark.operation.ExecuteStatement$$anon$1.run(ExecuteStatement.scala:125)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
   Caused by: org.apache.hudi.exception.HoodieUpsertException: Failed to upsert 
for commit time 20230331101517327
        at 
org.apache.hudi.table.action.commit.BaseWriteHelper.write(BaseWriteHelper.java:64)
        at 
org.apache.hudi.table.action.commit.SparkInsertOverwriteCommitActionExecutor.execute(SparkInsertOverwriteCommitActionExecutor.java:63)
        at 
org.apache.hudi.table.HoodieSparkCopyOnWriteTable.insertOverwrite(HoodieSparkCopyOnWriteTable.java:159)
        at 
org.apache.hudi.table.HoodieSparkCopyOnWriteTable.insertOverwrite(HoodieSparkCopyOnWriteTable.java:97)
        at 
org.apache.hudi.client.SparkRDDWriteClient.insertOverwrite(SparkRDDWriteClient.java:207)
        at 
org.apache.hudi.DataSourceUtils.doWriteOperation(DataSourceUtils.java:215)
        at 
org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:306)
        at 
org.apache.spark.sql.hudi.command.InsertIntoHoodieTableCommand$.run(InsertIntoHoodieTableCommand.scala:94)
        at 
org.apache.spark.sql.hudi.command.InsertIntoHoodieTableCommand.run(InsertIntoHoodieTableCommand.scala:47)
        at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
        at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
        at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
        at 
org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:228)
        at 
org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3700)
        at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
        at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
        at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
        at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
        at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
        at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3698)
        at org.apache.spark.sql.Dataset.<init>(Dataset.scala:228)
        at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:99)
        at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
        at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:96)
        at 
org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:618)
        at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
        at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:613)
        at 
org.apache.kyuubi.engine.spark.operation.ExecuteStatement.$anonfun$executeStatement$1(ExecuteStatement.scala:94)
        ... 9 more```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] HuangFru opened a new issue, #8332: [SUPPORT] Spark insert overwrite causes executors OOM.

Reply via email to