[ 
https://issues.apache.org/jira/browse/KYLIN-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chuxiao updated KYLIN-4250:
---------------------------
    Attachment: KYLIN-4250.master.001.patch

> FechRunnner should skip the job to process other jobs instead of throwing 
> exception when the job section metadata is not found
> ------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: KYLIN-4250
>                 URL: https://issues.apache.org/jira/browse/KYLIN-4250
>             Project: Kylin
>          Issue Type: Bug
>          Components: Job Engine
>    Affects Versions: v3.0.0-alpha
>            Reporter: chuxiao
>            Priority: Critical
>         Attachments: KYLIN-4250.master.001.patch
>
>
> problem:
> Our cluster has two nodes (named build1, build2) building cube jobs, and used 
> DistributedScheduler. 
> There is a job, id 9f05b84b-cec9-81ee-9336-5a419e451a55, shown built on the 
> build1 node.
> The job displays Error, but the first sub task  creating hive flat table 
> display Ready, and can see the first task's yarn job running through yarn ui. 
> After the yarn job is successful, the job re-runs the first sub-task,  again 
> and again.
> log:
> Looking at the build1 log, the status of this job is changed from ready to 
> running, then the first task status is ready to running, then the update job 
> information is broadcast,  then the update job information broadcast is 
> received. But after twenty seconds, a broadcast of the updated job 
> information was received.
> After a few minutes, the first task is completed, but the log shows that the 
> job status changed from Error to ready! Then the job status changed from 
> ready to running, the first task starts running again .... Repeat the above 
> log.
> I suspect that other nodes have changed the job status. Looking at the build2 
> node log, there are a lot of exception logs, about there is no output for 
> another job id f1b2024a-e6ed-3dd5-5a7d-7c267ead5f1d:
> {code:java}
> 2019-09-20 14:20:58,825 WARN  [pool-10-thread-1] 
> threadpool.DefaultFetcherRunner:90 : Job Fetcher caught a exception
> java.lang.IllegalArgumentException: there is no related output for job 
> id:f1b2024a-e6ed-3dd5-5a7d-7c267ead5f1d
>         at 
> com.google.common.base.Preconditions.checkArgument(Preconditions.java:92)
>         at 
> org.apache.kylin.job.execution.ExecutableManager.getOutputDigest(ExecutableManager.java:184)
>         at 
> org.apache.kylin.job.impl.threadpool.DefaultFetcherRunner.run(DefaultFetcherRunner.java:67)
>         at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>         at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
>         at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
>         at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745)
> {code}
> In addition, each build2 receives the broadcast of the build1 update the job 
> information, after twenty seconds, the log print changes the first task state 
> runinng to ready and broadcasts.
> Restarting the build2 node, not printing the Job Fetcher caught a exception , 
> and the job 9f05b84b-cec9-81ee-9336-5a419e451a55 was successfully executed.
> analysis
> This is due to a job metadata synchronization exception, which triggers a job 
> scheduling bug. Build1 node try to run the job, but another build node kills 
> the job and changes the job status to Error, causing problems.
> The build2 node may have a metadata synchronization problem, the job with the 
> id f1b2024a-e6ed-3dd5-5a7d-7c267ead5f1d exists in ExecutableDao's 
> executableDigestMap, and does not exist in ExecutableDao's 
> executableOutputDigestMap. Each time FetchRunner foreach the job, it throws 
> an exception and fetchFailed is set to true.
> {code:java}
> DefaultFetcherRunner:
> //throw exception
> final Output outputDigest = getExecutableManger().getOutputDigest(id);
> .
> .
> .
>         } catch (Throwable th) {
>             fetchFailed = true; // this could happen when resource store is 
> unavailable
>             logger.warn("Job Fetcher caught a exception ", th);
>         }
> {code}
> When the build2 first processes the job that build1 is running, since 
> fetchFailed is true, the job is not in the list of running jobs in build2, 
> the job status is running, FetchRunner.jobStateCount() will kill the job, and 
> set the running task status to ready, set the job status to error, broadcast.
> {code:java}
> FetchRunner.jobStateCount():
>     protected void jobStateCount(String id) {
>         final Output outputDigest = getExecutableManger().getOutputDigest(id);
>         // logger.debug("Job id:" + id + " not runnable");
>         if (outputDigest.getState() == ExecutableState.SUCCEED) {
>             nSUCCEED++;
>         } else if (outputDigest.getState() == ExecutableState.ERROR) {
>             nError++;
>         } else if (outputDigest.getState() == ExecutableState.DISCARDED) {
>             nDiscarded++;
>         } else if (outputDigest.getState() == ExecutableState.STOPPED) {
>             nStopped++;
>         } else {
>             if (fetchFailed) {
>                 //this code
>                 getExecutableManger().forceKillJob(id);
>                 nError++;
>             } else {
>                 nOthers++;
>             }
>         }
>     }
> {code}
> After the first task of the job runs successfully on the build1, the task 
> state is ready without change, and the job status is error,and executeResult 
> returns successfully, then the job status is changed to ready. The job status 
> Ready will not release the zk lock, build1 will continue to schedule the job 
> to run, and then be killed by build2, again and again. The build job has not 
> been able to run normally
> solve
> There are two problems with FetcherRunner:
> 1. When FechRunnner foreach job, if the metadata of the job part is not 
> found, an exception will be thrown. We can skip this job and foreach other 
> jobs.
> 2. For DistributedScheduler, even if FetchFailed is true, not in runningJobs, 
> the status is running, FetchRunner should not kill the job because the job 
> may be scheduler by another kylin service
> This jira solves the problem 1, another jira will solves the problem 2



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to