[ 
https://issues.apache.org/jira/browse/KYLIN-5008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shao Feng Shi resolved KYLIN-5008.
----------------------------------
    Resolution: Fixed

> backend spark was failed, but corresponding job status is shown as finished 
> in WebUI 
> -------------------------------------------------------------------------------------
>
>                 Key: KYLIN-5008
>                 URL: https://issues.apache.org/jira/browse/KYLIN-5008
>             Project: Kylin
>          Issue Type: Bug
>    Affects Versions: v4.0.0-beta
>            Reporter: ZHANGHONGJIA
>            Assignee: Yaqian Zhang
>            Priority: Major
>             Fix For: v4.0.3
>
>         Attachments: image-2021-06-10-16-46-35-919.png, 
> image-2021-06-15-15-27-45-099.png, image-2021-06-15-15-52-10-118.png, 
> image-2021-06-15-15-52-31-635.png, merge-job.log
>
>
> According to the log shown as below, the spark project was failed due to 
> Container killed by YARN for exceeding memory limits , but in Kylin WebUI 
> ,the status of the mergeJob is finished.  Besides, the amount of data in the 
> segment after merged is as three times as the amount of actual data . It 
> seems that kylin didn't monitor the failure of this merge job.
>  
> Here is the merge job log :
> ===============================================================
>  at 
> org.apache.kylin.engine.spark.job.BuildLayoutWithUpdate$1.call(BuildLayoutWithUpdate.java:43)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  ... 3 more
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 244 in stage 1108.0 failed 4 times, most recent failure: Lost task 244.3 
> in stage 1108.0 (TID 78736, r4200h1-app.travelsky.com, executor 109): 
> ExecutorLostFailure (executor 109 exited caused by one of the running tasks) 
> Reason: Container killed by YARN for exceeding memory limits. 39.0 GB of 36 
> GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead 
> or disabling yarn.nodemanager.vmem-check-enabled because of YARN-4714.
> Driver stacktrace:
>  at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1891)
>  at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1879)
>  at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1878)
>  at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>  at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1878)
>  at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:927)
>  at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:927)
>  at scala.Option.foreach(Option.scala:257)
>  at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:927)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2112)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2061)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2050)
>  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
>  at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:738)
>  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:167)
>  ... 34 more
> }
> RetryInfo{
>  overrideConf : \{spark.executor.memory=36618MB, 
> spark.executor.memoryOverhead=7323MB},
>  throwable : java.lang.RuntimeException: Error execute 
> org.apache.kylin.engine.spark.job.CubeMergeJob
>  at 
> org.apache.kylin.engine.spark.application.SparkApplication.execute(SparkApplication.java:92)
>  at org.apache.spark.application.JobWorker$$anon$2.run(JobWorker.scala:55)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.RuntimeException: org.apache.spark.SparkException: Job 
> aborted.
>  at 
> org.apache.kylin.engine.spark.job.BuildLayoutWithUpdate.updateLayout(BuildLayoutWithUpdate.java:70)
>  at 
> org.apache.kylin.engine.spark.job.CubeMergeJob.mergeSegments(CubeMergeJob.java:122)
>  at 
> org.apache.kylin.engine.spark.job.CubeMergeJob.doExecute(CubeMergeJob.java:82)
>  at 
> org.apache.kylin.engine.spark.application.SparkApplication.execute(SparkApplication.java:298)
>  at 
> org.apache.kylin.engine.spark.application.SparkApplication.execute(SparkApplication.java:89)
>  ... 4 more
> Caused by: org.apache.spark.SparkException: Job aborted.
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:198)
>  at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
>  at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
>  at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
>  at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
>  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
>  at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:83)
>  at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:81)
>  at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:677)
>  at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:677)
>  at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:80)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:127)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:75)
>  at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:677)
>  at 
> org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:286)
>  at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:272)
>  at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:230)
>  at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:567)
>  at 
> org.apache.kylin.engine.spark.storage.ParquetStorage.saveTo(ParquetStorage.scala:28)
>  at 
> org.apache.kylin.engine.spark.job.CubeMergeJob.saveAndUpdateCuboid(CubeMergeJob.java:171)
>  at 
> org.apache.kylin.engine.spark.job.CubeMergeJob.access$000(CubeMergeJob.java:59)
>  at 
> org.apache.kylin.engine.spark.job.CubeMergeJob$1.build(CubeMergeJob.java:118)
>  at 
> org.apache.kylin.engine.spark.job.BuildLayoutWithUpdate$1.call(BuildLayoutWithUpdate.java:51)
>  at 
> org.apache.kylin.engine.spark.job.BuildLayoutWithUpdate$1.call(BuildLayoutWithUpdate.java:43)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  ... 3 more
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 428 in stage 360.0 failed 4 times, most recent failure: Lost task 428.3 
> in stage 360.0 (TID 26130, umetrip40-hdp2.6-140.travelsky.com, executor 1): 
> ExecutorLostFailure (executor 1 exited caused by one of the running tasks) 
> Reason: Container killed by YARN for exceeding memory limits. 48.4 GB of 46 
> GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead 
> or disabling yarn.nodemanager.vmem-check-enabled because of YARN-4714.
> Driver stacktrace:
>  at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1891)
>  at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1879)
>  at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1878)
>  at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>  at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1878)
>  at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:927)
>  at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:927)
>  at scala.Option.foreach(Option.scala:257)
>  at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:927)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2112)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2061)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2050)
>  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
>  at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:738)
>  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:167)
>  ... 34 more
> }
>  
> The WebUI monitor:
> !image-2021-06-10-16-46-35-919.png!
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to