[GitHub] [hudi] chandu-1101 commented on issue #9329: [SUPPORT] Hudi upsert takes more time than merging using spark sql

via GitHub Mon, 14 Aug 2023 06:46:26 -0700


chandu-1101 commented on issue #9329:
URL: https://github.com/apache/hudi/issues/9329#issuecomment-1677348085


   Hi,
   
   After painstakingly...
   
   1. Getting the partition data (by created date) -- in Json
   2. Getting the parquet snapshot partitioned by created date.
   3. Created hudi table on s3 from step 2, partitioned by created date --This 
fails. 
   
   I am not sure what could be my next step!
   
   I tried with executor memory form 3G, 6G, 8G. while only 1 executor runs per 
node, executor cores=1, and the node has 4 cores 16GB ram. The task keeps 
failing after writing for 30-40 minutes. 
   
   Hudi version:
   ```
    --packages org.apache.hudi:hudi-spark3.3-bundle_2.12:0.13.1
   ```
   Spark
   ```
   3.3.0
   ```
   
   Emr
   ```
   6.9.0
   ```
   
   spark shell code
   ```
     val snapshotDf = 
sess.read.parquet("s3://bucket/snapshots2/ge11-partitioned/")
       snapshotDf.write.format("hudi")
         .options(getQuickstartWriteConfigs)
         .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "cdc_pk")
         .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "_id.oid")
         .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, 
"__created_date_")
         .option(HoodieWriteConfig.TABLE_NAME,"GE11")
         .mode(SaveMode.Overwrite)
         .save("s3://partitioned/snapshots2/ge11-hudi/");
   ```
   
   Ganglia snapshot
   <img width="1270" alt="image" 
src="https://github.com/apache/hudi/assets/138012432/d2e0a0e3-099a-42f6-905e-97ba0e05c32c";>
   
   exception
   ```
   
   23/08/14 13:36:05 WARN TaskSetManager: Lost task 1.3 in stage 3.0 (TID 970) 
(ip-172-25-26-247.prod.phenom.local executor 14): ExecutorLostFailure (executor 
14 exited caused by one of the running tasks) Reason: Container from a bad 
node: container_1692006436772_0006_01_000017 on host: 
ip-172-25-26-247.prod.phenom.local. Exit status: 137. Diagnostics: [2023-08-14 
13:36:05.150]Container killed on request. Exit code is 137
   [2023-08-14 13:36:05.150]Container exited with a non-zero exit code 137.
   [2023-08-14 13:36:05.150]Killed by external signal
   .
   23/08/14 13:36:05 ERROR TaskSetManager: Task 1 in stage 3.0 failed 4 times; 
aborting job
   org.apache.hudi.exception.HoodieUpsertException: Failed to upsert for commit 
time 20230814132104020
     at 
org.apache.hudi.table.action.commit.BaseWriteHelper.write(BaseWriteHelper.java:75)
     at 
org.apache.hudi.table.action.commit.SparkUpsertCommitActionExecutor.execute(SparkUpsertCommitActionExecutor.java:44)
     at 
org.apache.hudi.table.HoodieSparkCopyOnWriteTable.upsert(HoodieSparkCopyOnWriteTable.java:107)
     at 
org.apache.hudi.table.HoodieSparkCopyOnWriteTable.upsert(HoodieSparkCopyOnWriteTable.java:96)
     at 
org.apache.hudi.client.SparkRDDWriteClient.upsert(SparkRDDWriteClient.java:140)
     at 
org.apache.hudi.DataSourceUtils.doWriteOperation(DataSourceUtils.java:214)
     at 
org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:372)
     at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:150)
     at 
org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
     at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
     at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
     at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84)
     at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:103)
     at 
org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
     at 
org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:224)
     at 
org.apache.spark.sql.execution.SQLExecution$.executeQuery$1(SQLExecution.scala:114)
     at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$7(SQLExecution.scala:139)
     at 
org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
     at 
org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:224)
     at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:139)
     at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:245)
     at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:138)
     at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)
     at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:68)
     at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:100)
     at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:96)
     at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:615)
     at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:177)
     at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:615)
     at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:30)
     at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
     at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
     at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
     at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
     at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:591)
     at 
org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:96)
     at 
org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:83)
     at 
org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:81)
     at 
org.apache.spark.sql.execution.QueryExecution.assertCommandExecuted(QueryExecution.scala:124)
     at 
org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:860)
     at 
org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:390)
     at 
org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:363)
     at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:239)
     ... 49 elided
   Caused by: org.apache.spark.SparkException: Job aborted due to stage 
failure: Task 1 in stage 3.0 failed 4 times, most recent failure: Lost task 1.3 
in stage 3.0 (TID 970) (ip-172-25-26-247.prod.phenom.local executor 14): 
ExecutorLostFailure (executor 14 exited caused by one of the running tasks) 
Reason: Container from a bad node: container_1692006436772_0006_01_000017 on 
host: ip-172-25-26-247.prod.phenom.local. Exit status: 137. Diagnostics: 
[2023-08-14 13:36:05.150]Container killed on request. Exit code is 137
   [2023-08-14 13:36:05.150]Container exited with a non-zero exit code 137.
   [2023-08-14 13:36:05.150]Killed by external signal
   .
   Driver stacktrace:
     at 
org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2863)
     at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2799)
     at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2798)
     at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
     at 
scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
     at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
     at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2798)
     at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1239)
     at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1239)
     at scala.Option.foreach(Option.scala:407)
     at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1239)
     at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:3051)
     at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2993)
     at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2982)
     at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
     at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:1009)
     at org.apache.spark.SparkContext.runJob(SparkContext.scala:2229)
     at org.apache.spark.SparkContext.runJob(SparkContext.scala:2250)
     at org.apache.spark.SparkContext.runJob(SparkContext.scala:2269)
     at org.apache.spark.SparkContext.runJob(SparkContext.scala:2294)
     at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1021)
     at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
     at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
     at org.apache.spark.rdd.RDD.withScope(RDD.scala:406)
     at org.apache.spark.rdd.RDD.collect(RDD.scala:1020)
     at 
org.apache.spark.rdd.PairRDDFunctions.$anonfun$countByKey$1(PairRDDFunctions.scala:367)
     at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
     at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
     at org.apache.spark.rdd.RDD.withScope(RDD.scala:406)
     at 
org.apache.spark.rdd.PairRDDFunctions.countByKey(PairRDDFunctions.scala:367)
     at org.apache.spark.api.java.JavaPairRDD.countByKey(JavaPairRDD.scala:314)
     at 
org.apache.hudi.data.HoodieJavaPairRDD.countByKey(HoodieJavaPairRDD.java:105)
     at 
org.apache.hudi.index.bloom.HoodieBloomIndex.lookupIndex(HoodieBloomIndex.java:121)
     at 
org.apache.hudi.index.bloom.HoodieBloomIndex.tagLocation(HoodieBloomIndex.java:90)
     at 
org.apache.hudi.table.action.commit.HoodieWriteHelper.tag(HoodieWriteHelper.java:55)
     at 
org.apache.hudi.table.action.commit.HoodieWriteHelper.tag(HoodieWriteHelper.java:37)
     at 
org.apache.hudi.table.action.commit.BaseWriteHelper.write(BaseWriteHelper.java:64)
     ... 91 more
   
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] chandu-1101 commented on issue #9329: [SUPPORT] Hudi upsert takes more time than merging using spark sql

Reply via email to