[
https://issues.apache.org/jira/browse/KYLIN-5008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Shao Feng Shi resolved KYLIN-5008.
----------------------------------
Resolution: Fixed
> backend spark was failed, but corresponding job status is shown as finished
> in WebUI
> -------------------------------------------------------------------------------------
>
> Key: KYLIN-5008
> URL: https://issues.apache.org/jira/browse/KYLIN-5008
> Project: Kylin
> Issue Type: Bug
> Affects Versions: v4.0.0-beta
> Reporter: ZHANGHONGJIA
> Assignee: Yaqian Zhang
> Priority: Major
> Fix For: v4.0.3
>
> Attachments: image-2021-06-10-16-46-35-919.png,
> image-2021-06-15-15-27-45-099.png, image-2021-06-15-15-52-10-118.png,
> image-2021-06-15-15-52-31-635.png, merge-job.log
>
>
> According to the log shown as below, the spark project was failed due to
> Container killed by YARN for exceeding memory limits , but in Kylin WebUI
> ,the status of the mergeJob is finished. Besides, the amount of data in the
> segment after merged is as three times as the amount of actual data . It
> seems that kylin didn't monitor the failure of this merge job.
>
> Here is the merge job log :
> ===============================================================
> at
> org.apache.kylin.engine.spark.job.BuildLayoutWithUpdate$1.call(BuildLayoutWithUpdate.java:43)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> ... 3 more
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure:
> Task 244 in stage 1108.0 failed 4 times, most recent failure: Lost task 244.3
> in stage 1108.0 (TID 78736, r4200h1-app.travelsky.com, executor 109):
> ExecutorLostFailure (executor 109 exited caused by one of the running tasks)
> Reason: Container killed by YARN for exceeding memory limits. 39.0 GB of 36
> GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead
> or disabling yarn.nodemanager.vmem-check-enabled because of YARN-4714.
> Driver stacktrace:
> at
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1891)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1879)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1878)
> at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1878)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:927)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:927)
> at scala.Option.foreach(Option.scala:257)
> at
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:927)
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2112)
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2061)
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2050)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
> at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:738)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
> at
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:167)
> ... 34 more
> }
> RetryInfo{
> overrideConf : \{spark.executor.memory=36618MB,
> spark.executor.memoryOverhead=7323MB},
> throwable : java.lang.RuntimeException: Error execute
> org.apache.kylin.engine.spark.job.CubeMergeJob
> at
> org.apache.kylin.engine.spark.application.SparkApplication.execute(SparkApplication.java:92)
> at org.apache.spark.application.JobWorker$$anon$2.run(JobWorker.scala:55)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.RuntimeException: org.apache.spark.SparkException: Job
> aborted.
> at
> org.apache.kylin.engine.spark.job.BuildLayoutWithUpdate.updateLayout(BuildLayoutWithUpdate.java:70)
> at
> org.apache.kylin.engine.spark.job.CubeMergeJob.mergeSegments(CubeMergeJob.java:122)
> at
> org.apache.kylin.engine.spark.job.CubeMergeJob.doExecute(CubeMergeJob.java:82)
> at
> org.apache.kylin.engine.spark.application.SparkApplication.execute(SparkApplication.java:298)
> at
> org.apache.kylin.engine.spark.application.SparkApplication.execute(SparkApplication.java:89)
> ... 4 more
> Caused by: org.apache.spark.SparkException: Job aborted.
> at
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:198)
> at
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
> at
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
> at
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
> at
> org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
> at
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
> at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
> at
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:83)
> at
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:81)
> at
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:677)
> at
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:677)
> at
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:80)
> at
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:127)
> at
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:75)
> at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:677)
> at
> org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:286)
> at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:272)
> at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:230)
> at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:567)
> at
> org.apache.kylin.engine.spark.storage.ParquetStorage.saveTo(ParquetStorage.scala:28)
> at
> org.apache.kylin.engine.spark.job.CubeMergeJob.saveAndUpdateCuboid(CubeMergeJob.java:171)
> at
> org.apache.kylin.engine.spark.job.CubeMergeJob.access$000(CubeMergeJob.java:59)
> at
> org.apache.kylin.engine.spark.job.CubeMergeJob$1.build(CubeMergeJob.java:118)
> at
> org.apache.kylin.engine.spark.job.BuildLayoutWithUpdate$1.call(BuildLayoutWithUpdate.java:51)
> at
> org.apache.kylin.engine.spark.job.BuildLayoutWithUpdate$1.call(BuildLayoutWithUpdate.java:43)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> ... 3 more
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure:
> Task 428 in stage 360.0 failed 4 times, most recent failure: Lost task 428.3
> in stage 360.0 (TID 26130, umetrip40-hdp2.6-140.travelsky.com, executor 1):
> ExecutorLostFailure (executor 1 exited caused by one of the running tasks)
> Reason: Container killed by YARN for exceeding memory limits. 48.4 GB of 46
> GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead
> or disabling yarn.nodemanager.vmem-check-enabled because of YARN-4714.
> Driver stacktrace:
> at
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1891)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1879)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1878)
> at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1878)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:927)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:927)
> at scala.Option.foreach(Option.scala:257)
> at
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:927)
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2112)
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2061)
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2050)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
> at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:738)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
> at
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:167)
> ... 34 more
> }
>
> The WebUI monitor:
> !image-2021-06-10-16-46-35-919.png!
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)