chuxiao created KYLIN-4250:
------------------------------
Summary: FechRunnner should skip the job to process other jobs
instead of throwing exception when the job section metadata is not found
Key: KYLIN-4250
URL: https://issues.apache.org/jira/browse/KYLIN-4250
Project: Kylin
Issue Type: Bug
Components: Job Engine
Affects Versions: v3.0.0-alpha
Reporter: chuxiao
problem:
Our cluster has two nodes (named build1, build2) building cube jobs, and used
DistributedScheduler.
There is a job, id 9f05b84b-cec9-81ee-9336-5a419e451a55, shown built on the
build1 node.
The job displays Error, but the first sub task creating hive flat table
display Ready, and can see the first task's yarn job running through yarn ui.
After the yarn job is successful, the job re-runs the first sub-task, again
and again.
log:
Looking at the build1 log, the status of this job is changed from ready to
running, then the first task status is ready to running, then the update job
information is broadcast, then the update job information broadcast is
received. But after twenty seconds, a broadcast of the updated job information
was received.
After a few minutes, the first task is completed, but the log shows that the
job status changed from Error to ready! Then the job status changed from ready
to running, the first task starts running again .... Repeat the above log.
I suspect that other nodes have changed the job status. Looking at the build2
node log, there are a lot of exception logs, about there is no output for
another job id f1b2024a-e6ed-3dd5-5a7d-7c267ead5f1d:
{code:java}
2019-09-20 14:20:58,825 WARN [pool-10-thread-1]
threadpool.DefaultFetcherRunner:90 : Job Fetcher caught a exception
java.lang.IllegalArgumentException: there is no related output for job
id:f1b2024a-e6ed-3dd5-5a7d-7c267ead5f1d
at
com.google.common.base.Preconditions.checkArgument(Preconditions.java:92)
at
org.apache.kylin.job.execution.ExecutableManager.getOutputDigest(ExecutableManager.java:184)
at
org.apache.kylin.job.impl.threadpool.DefaultFetcherRunner.run(DefaultFetcherRunner.java:67)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}
In addition, each build2 receives the broadcast of the build1 update the job
information, after twenty seconds, the log print changes the first task state
runinng to ready and broadcasts.
Restarting the build2 node, not printing the Job Fetcher caught a exception ,
and the job 9f05b84b-cec9-81ee-9336-5a419e451a55 was successfully executed.
analysis
This is due to a job metadata synchronization exception, which triggers a job
scheduling bug. Build1 node try to run the job, but another build node kills
the job and changes the job status to Error, causing problems.
The build2 node may have a metadata synchronization problem, the job with the
id f1b2024a-e6ed-3dd5-5a7d-7c267ead5f1d exists in ExecutableDao's
executableDigestMap, and does not exist in ExecutableDao's
executableOutputDigestMap. Each time FetchRunner foreach the job, it throws an
exception and fetchFailed is set to true.
{code:java}
DefaultFetcherRunner:
//throw exception
final Output outputDigest = getExecutableManger().getOutputDigest(id);
.
.
.
} catch (Throwable th) {
fetchFailed = true; // this could happen when resource store is
unavailable
logger.warn("Job Fetcher caught a exception ", th);
}
{code}
When the build2 first processes the job that build1 is running, since
fetchFailed is true, the job is not in the list of running jobs in build2, the
job status is running, FetchRunner.jobStateCount() will kill the job, and set
the running task status to ready, set the job status to error, broadcast.
{code:java}
FetchRunner.jobStateCount():
protected void jobStateCount(String id) {
final Output outputDigest = getExecutableManger().getOutputDigest(id);
// logger.debug("Job id:" + id + " not runnable");
if (outputDigest.getState() == ExecutableState.SUCCEED) {
nSUCCEED++;
} else if (outputDigest.getState() == ExecutableState.ERROR) {
nError++;
} else if (outputDigest.getState() == ExecutableState.DISCARDED) {
nDiscarded++;
} else if (outputDigest.getState() == ExecutableState.STOPPED) {
nStopped++;
} else {
if (fetchFailed) {
//this code
getExecutableManger().forceKillJob(id);
nError++;
} else {
nOthers++;
}
}
}
{code}
After the first task of the job runs successfully on the build1, the task state
is ready without change, and the job status is error,and executeResult returns
successfully, then the job status is changed to ready. The job status Ready
will not release the zk lock, build1 will continue to schedule the job to run,
and then be killed by build2, again and again. The build job has not been able
to run normally
solve
There are two problems with FetcherRunner:
1. When FechRunnner foreach job, if the metadata of the job part is not found,
an exception will be thrown. We can skip this job and foreach other jobs.
2. For DistributedScheduler, even if FetchFailed is true, not in runningJobs,
the status is running, FetchRunner should not kill the job because the job may
be scheduler by another kylin service
This jira solves the problem 1, another jira will solves the problem 2
--
This message was sent by Atlassian Jira
(v8.3.4#803005)