chuxiao created KYLIN-4250:
------------------------------

             Summary: FechRunnner should skip the job to process other jobs 
instead of throwing exception when the job section metadata is not found
                 Key: KYLIN-4250
                 URL: https://issues.apache.org/jira/browse/KYLIN-4250
             Project: Kylin
          Issue Type: Bug
          Components: Job Engine
    Affects Versions: v3.0.0-alpha
            Reporter: chuxiao


problem:
Our cluster has two nodes (named build1, build2) building cube jobs, and used 
DistributedScheduler. 
There is a job, id 9f05b84b-cec9-81ee-9336-5a419e451a55, shown built on the 
build1 node.
The job displays Error, but the first sub task  creating hive flat table 
display Ready, and can see the first task's yarn job running through yarn ui. 
After the yarn job is successful, the job re-runs the first sub-task,  again 
and again.


log:
Looking at the build1 log, the status of this job is changed from ready to 
running, then the first task status is ready to running, then the update job 
information is broadcast,  then the update job information broadcast is 
received. But after twenty seconds, a broadcast of the updated job information 
was received.

After a few minutes, the first task is completed, but the log shows that the 
job status changed from Error to ready! Then the job status changed from ready 
to running, the first task starts running again .... Repeat the above log.


I suspect that other nodes have changed the job status. Looking at the build2 
node log, there are a lot of exception logs, about there is no output for 
another job id f1b2024a-e6ed-3dd5-5a7d-7c267ead5f1d:

{code:java}
2019-09-20 14:20:58,825 WARN  [pool-10-thread-1] 
threadpool.DefaultFetcherRunner:90 : Job Fetcher caught a exception
java.lang.IllegalArgumentException: there is no related output for job 
id:f1b2024a-e6ed-3dd5-5a7d-7c267ead5f1d
        at 
com.google.common.base.Preconditions.checkArgument(Preconditions.java:92)
        at 
org.apache.kylin.job.execution.ExecutableManager.getOutputDigest(ExecutableManager.java:184)
        at 
org.apache.kylin.job.impl.threadpool.DefaultFetcherRunner.run(DefaultFetcherRunner.java:67)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
        at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
        at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
{code}

In addition, each build2 receives the broadcast of the build1 update the job 
information, after twenty seconds, the log print changes the first task state 
runinng to ready and broadcasts.

Restarting the build2 node, not printing the Job Fetcher caught a exception , 
and the job 9f05b84b-cec9-81ee-9336-5a419e451a55 was successfully executed.


analysis
This is due to a job metadata synchronization exception, which triggers a job 
scheduling bug. Build1 node try to run the job, but another build node kills 
the job and changes the job status to Error, causing problems.

The build2 node may have a metadata synchronization problem, the job with the 
id f1b2024a-e6ed-3dd5-5a7d-7c267ead5f1d exists in ExecutableDao's 
executableDigestMap, and does not exist in ExecutableDao's 
executableOutputDigestMap. Each time FetchRunner foreach the job, it throws an 
exception and fetchFailed is set to true.

{code:java}
DefaultFetcherRunner:
//throw exception
final Output outputDigest = getExecutableManger().getOutputDigest(id);
.
.
.
        } catch (Throwable th) {
            fetchFailed = true; // this could happen when resource store is 
unavailable
            logger.warn("Job Fetcher caught a exception ", th);
        }
{code}


When the build2 first processes the job that build1 is running, since 
fetchFailed is true, the job is not in the list of running jobs in build2, the 
job status is running, FetchRunner.jobStateCount() will kill the job, and set 
the running task status to ready, set the job status to error, broadcast.

{code:java}
FetchRunner.jobStateCount():
    protected void jobStateCount(String id) {
        final Output outputDigest = getExecutableManger().getOutputDigest(id);
        // logger.debug("Job id:" + id + " not runnable");
        if (outputDigest.getState() == ExecutableState.SUCCEED) {
            nSUCCEED++;
        } else if (outputDigest.getState() == ExecutableState.ERROR) {
            nError++;
        } else if (outputDigest.getState() == ExecutableState.DISCARDED) {
            nDiscarded++;
        } else if (outputDigest.getState() == ExecutableState.STOPPED) {
            nStopped++;
        } else {
            if (fetchFailed) {
                //this code
                getExecutableManger().forceKillJob(id);
                nError++;
            } else {
                nOthers++;
            }
        }
    }
{code}


After the first task of the job runs successfully on the build1, the task state 
is ready without change, and the job status is error,and executeResult returns 
successfully, then the job status is changed to ready. The job status Ready 
will not release the zk lock, build1 will continue to schedule the job to run, 
and then be killed by build2, again and again. The build job has not been able 
to run normally

solve
There are two problems with FetcherRunner:
1. When FechRunnner foreach job, if the metadata of the job part is not found, 
an exception will be thrown. We can skip this job and foreach other jobs.

2. For DistributedScheduler, even if FetchFailed is true, not in runningJobs, 
the status is running, FetchRunner should not kill the job because the job may 
be scheduler by another kylin service


This jira solves the problem 1, another jira will solves the problem 2



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to