[jira] [Commented] (KYLIN-3554) Spark job failed but Yarn shows SUCCEED, causing Kylin move to next step

ASF subversion and git services (JIRA) Mon, 17 Sep 2018 03:59:01 -0700


    [ 
https://issues.apache.org/jira/browse/KYLIN-3554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16617361#comment-16617361
 ]


ASF subversion and git services commented on KYLIN-3554:
--------------------------------------------------------

Commit 66e55ddc6432996ba7a334f36c9377b8d24d4c4f in kylin's branch 
refs/heads/2.5.0-hadoop3.1 from shaofengshi
[ https://gitbox.apache.org/repos/asf?p=kylin.git;h=66e55dd ]

KYLIN-3554 Kylin move to next step on Spark job resumed


> Spark job failed but Yarn shows SUCCEED, causing Kylin move to next step
> ------------------------------------------------------------------------
>
>                 Key: KYLIN-3554
>                 URL: https://issues.apache.org/jira/browse/KYLIN-3554
>             Project: Kylin
>          Issue Type: Bug
>          Components: Spark Engine
>    Affects Versions: v2.5.0
>            Reporter: Shaofeng SHI
>            Assignee: Shaofeng SHI
>            Priority: Major
>             Fix For: v2.5.0
>
>
> In KYLIN-2998, we modified SparkExecutable to check Spark job's status from 
> Yarn in "onExecuteStart", to avoid repeatedly submit job to Hadoop. 
> While today I found it doesn't work. The spark job got failure:
> {code:java}
> 18/09/11 06:39:03 INFO execution.SparkSqlParser: Parsing command: 
> default.kylin_intermediate_kylin_sales_cube_clone_clone_ea7091c0_a057_b584_4ccf_2368ec5e9f55
> 18/09/11 06:39:03 INFO metastore.HiveMetaStore: 0: get_table : db=default 
> tbl=kylin_intermediate_kylin_sales_cube_clone_clone_ea7091c0_a057_b584_4ccf_2368ec5e9f55
> 18/09/11 06:39:03 INFO HiveMetaStore.audit: ugi=root ip=unknown-ip-addr 
> cmd=get_table : db=default 
> tbl=kylin_intermediate_kylin_sales_cube_clone_clone_ea7091c0_a057_b584_4ccf_2368ec5e9f55
> Exception in thread "main" java.lang.RuntimeException: error execute 
> org.apache.kylin.engine.spark.SparkFactDistinct. Root cause: Table or view 
> 'kylin_intermediate_kylin_sales_cube_clone_clone_ea7091c0_a057_b584_4ccf_2368ec5e9f55'
>  not found in database 'default';
> at 
> org.apache.kylin.common.util.AbstractApplication.execute(AbstractApplication.java:42)
> at org.apache.kylin.common.util.SparkEntry.main(SparkEntry.java:44)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:744)
> at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: org.apache.spark.sql.catalyst.analysis.NoSuchTableException: Table 
> or view 
> 'kylin_intermediate_kylin_sales_cube_clone_clone_ea7091c0_a057_b584_4ccf_2368ec5e9f55'
>  not found in database 'default';
> at 
> org.apache.spark.sql.hive.client.HiveClient$$anonfun$getTable$1.apply(HiveClient.scala:74)
> at 
> org.apache.spark.sql.hive.client.HiveClient$$anonfun$getTable$1.apply(HiveClient.scala:74)
> at scala.Option.getOrElse(Option.scala:121)
> at 
> org.apache.spark.sql.hive.client.HiveClient$class.getTable(HiveClient.scala:74)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.getTable(HiveClientImpl.scala:78)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$org$apache$spark$sql$hive$HiveExternalCatalog$$getRawTable$1.apply(HiveExternalCatalog.scala:118)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$org$apache$spark$sql$hive$HiveExternalCatalog$$getRawTable$1.apply(HiveExternalCatalog.scala:118)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.org$apache$spark$sql$hive$HiveExternalCatalog$$getRawTable(HiveExternalCatalog.scala:117)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$getTable$1.apply(HiveExternalCatalog.scala:628)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$getTable$1.apply(HiveExternalCatalog.scala:628)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.getTable(HiveExternalCatalog.scala:627)
> at 
> org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:124)
> at 
> org.apache.spark.sql.hive.HiveSessionCatalog.lookupRelation(HiveSessionCatalog.scala:70)
> at org.apache.spark.sql.SparkSession.table(SparkSession.scala:586)
> at org.apache.spark.sql.SparkSession.table(SparkSession.scala:582)
> at 
> org.apache.kylin.engine.spark.SparkUtil.hiveRecordInputRDD(SparkUtil.java:157)
> at 
> org.apache.kylin.engine.spark.SparkFactDistinct.execute(SparkFactDistinct.java:186)
> at 
> org.apache.kylin.common.util.AbstractApplication.execute(AbstractApplication.java:37)
> ... 10 more
> {code}
> But on Yarn, this job shows succeed:
> {code:java}
> 2018-09-11 06:40:06,876 INFO [Scheduler 1854232624 Job 
> a3e39298-8dc3-21f2-cf16-0aa5e451c777-92] spark.SparkExecutable:38 : 
> Application Report :
> 2018-09-11 06:40:06,876 INFO [Scheduler 1854232624 Job 
> a3e39298-8dc3-21f2-cf16-0aa5e451c777-92] spark.SparkExecutable:38 : 
> Application-Id : application_1533021510235_1085
> 2018-09-11 06:40:06,876 INFO [Scheduler 1854232624 Job 
> a3e39298-8dc3-21f2-cf16-0aa5e451c777-92] spark.SparkExecutable:38 : 
> Application-Name : Fact distinct columns for:kylin_sales_cube_clone_clone 
> segment ea7091c0-a057-b584-4ccf-2368ec5e9f55
> 2018-09-11 06:40:06,876 INFO [Scheduler 1854232624 Job 
> a3e39298-8dc3-21f2-cf16-0aa5e451c777-92] spark.SparkExecutable:38 : 
> Application-Type : SPARK
> 2018-09-11 06:40:06,876 INFO [Scheduler 1854232624 Job 
> a3e39298-8dc3-21f2-cf16-0aa5e451c777-92] spark.SparkExecutable:38 : User : 
> root
> 2018-09-11 06:40:06,876 INFO [Scheduler 1854232624 Job 
> a3e39298-8dc3-21f2-cf16-0aa5e451c777-92] spark.SparkExecutable:38 : Queue : 
> default
> 2018-09-11 06:40:06,876 INFO [Scheduler 1854232624 Job 
> a3e39298-8dc3-21f2-cf16-0aa5e451c777-92] spark.SparkExecutable:38 : 
> Start-Time : 1536647893696
> 2018-09-11 06:40:06,876 INFO [Scheduler 1854232624 Job 
> a3e39298-8dc3-21f2-cf16-0aa5e451c777-92] spark.SparkExecutable:38 : 
> Finish-Time : 1536647943612
> 2018-09-11 06:40:06,876 INFO [Scheduler 1854232624 Job 
> a3e39298-8dc3-21f2-cf16-0aa5e451c777-92] spark.SparkExecutable:38 : Progress 
> : 100%
> 2018-09-11 06:40:06,876 INFO [Scheduler 1854232624 Job 
> a3e39298-8dc3-21f2-cf16-0aa5e451c777-92] spark.SparkExecutable:38 : State : 
> FINISHED
> 2018-09-11 06:40:06,877 INFO [Scheduler 1854232624 Job 
> a3e39298-8dc3-21f2-cf16-0aa5e451c777-92] spark.SparkExecutable:38 : 
> Final-State : SUCCEEDED
> 2018-09-11 06:40:06,877 INFO [Scheduler 1854232624 Job 
> a3e39298-8dc3-21f2-cf16-0aa5e451c777-92] spark.SparkExecutable:38 : 
> Tracking-URL : N/A
> 2018-09-11 06:40:06,877 INFO [Scheduler 1854232624 Job 
> a3e39298-8dc3-21f2-cf16-0aa5e451c777-92] spark.SparkExecutable:38 : RPC Port 
> : 0
> 2018-09-11 06:40:06,877 INFO [Scheduler 1854232624 Job 
> a3e39298-8dc3-21f2-cf16-0aa5e451c777-92] spark.SparkExecutable:38 : AM Host : 
> 10.1.2.137
> 2018-09-11 06:40:06,877 INFO [Scheduler 1854232624 Job 
> a3e39298-8dc3-21f2-cf16-0aa5e451c777-92] spark.SparkExecutable:38 : Aggregate 
> Resource Allocation : 234749 MB-seconds, 140 vcore-seconds
> 2018-09-11 06:40:06,877 INFO [Scheduler 1854232624 Job 
> a3e39298-8dc3-21f2-cf16-0aa5e451c777-92] spark.SparkExecutable:38 : Log 
> Aggregation Status : SUCCEEDED
> 2018-09-11 06:40:06,877 INFO [Scheduler 1854232624 Job 
> a3e39298-8dc3-21f2-cf16-0aa5e451c777-92] spark.SparkExecutable:38 : 
> Diagnostics :
> {code}
> So when job resumed, Kylin though the job was succeed, and move ahead. But in 
> next step, it detected that no file be generated in last step:
> {code:java}
> 2018-09-11 06:40:12,221 ERROR [Scheduler 1854232624 Job 
> a3e39298-8dc3-21f2-cf16-0aa5e451c777-100] steps.SaveStatisticsStep:149 : fail 
> to save cuboid statistics
> java.io.IOException: fail to find the statistics file in base dir: 
> /kylin/kylin_default_instance/kylin-a3e39298-8dc3-21f2-cf16-0aa5e451c777/kylin_sales_cube_clone_clone/fact_distinct_columns/statistics
> at 
> org.apache.kylin.engine.mr.steps.SaveStatisticsStep.doWork(SaveStatisticsStep.java:78)
> at 
> org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:163)
> at 
> org.apache.kylin.job.execution.DefaultChainedExecutable.doWork(DefaultChainedExecutable.java:69)
> at 
> org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:163)
> at 
> org.apache.kylin.job.impl.threadpool.DefaultScheduler$JobRunner.run(DefaultScheduler.java:113)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> {code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (KYLIN-3554) Spark job failed but Yarn shows SUCCEED, causing Kylin move to next step

Reply via email to