[
https://issues.apache.org/jira/browse/KYLIN-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
chuxiao updated KYLIN-4250:
---------------------------
Attachment: KYLIN-4250.master.001.patch
> FechRunnner should skip the job to process other jobs instead of throwing
> exception when the job section metadata is not found
> ------------------------------------------------------------------------------------------------------------------------------
>
> Key: KYLIN-4250
> URL: https://issues.apache.org/jira/browse/KYLIN-4250
> Project: Kylin
> Issue Type: Bug
> Components: Job Engine
> Affects Versions: v3.0.0-alpha
> Reporter: chuxiao
> Priority: Critical
> Attachments: KYLIN-4250.master.001.patch
>
>
> problem:
> Our cluster has two nodes (named build1, build2) building cube jobs, and used
> DistributedScheduler.
> There is a job, id 9f05b84b-cec9-81ee-9336-5a419e451a55, shown built on the
> build1 node.
> The job displays Error, but the first sub task creating hive flat table
> display Ready, and can see the first task's yarn job running through yarn ui.
> After the yarn job is successful, the job re-runs the first sub-task, again
> and again.
> log:
> Looking at the build1 log, the status of this job is changed from ready to
> running, then the first task status is ready to running, then the update job
> information is broadcast, then the update job information broadcast is
> received. But after twenty seconds, a broadcast of the updated job
> information was received.
> After a few minutes, the first task is completed, but the log shows that the
> job status changed from Error to ready! Then the job status changed from
> ready to running, the first task starts running again .... Repeat the above
> log.
> I suspect that other nodes have changed the job status. Looking at the build2
> node log, there are a lot of exception logs, about there is no output for
> another job id f1b2024a-e6ed-3dd5-5a7d-7c267ead5f1d:
> {code:java}
> 2019-09-20 14:20:58,825 WARN [pool-10-thread-1]
> threadpool.DefaultFetcherRunner:90 : Job Fetcher caught a exception
> java.lang.IllegalArgumentException: there is no related output for job
> id:f1b2024a-e6ed-3dd5-5a7d-7c267ead5f1d
> at
> com.google.common.base.Preconditions.checkArgument(Preconditions.java:92)
> at
> org.apache.kylin.job.execution.ExecutableManager.getOutputDigest(ExecutableManager.java:184)
> at
> org.apache.kylin.job.impl.threadpool.DefaultFetcherRunner.run(DefaultFetcherRunner.java:67)
> at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> In addition, each build2 receives the broadcast of the build1 update the job
> information, after twenty seconds, the log print changes the first task state
> runinng to ready and broadcasts.
> Restarting the build2 node, not printing the Job Fetcher caught a exception ,
> and the job 9f05b84b-cec9-81ee-9336-5a419e451a55 was successfully executed.
> analysis
> This is due to a job metadata synchronization exception, which triggers a job
> scheduling bug. Build1 node try to run the job, but another build node kills
> the job and changes the job status to Error, causing problems.
> The build2 node may have a metadata synchronization problem, the job with the
> id f1b2024a-e6ed-3dd5-5a7d-7c267ead5f1d exists in ExecutableDao's
> executableDigestMap, and does not exist in ExecutableDao's
> executableOutputDigestMap. Each time FetchRunner foreach the job, it throws
> an exception and fetchFailed is set to true.
> {code:java}
> DefaultFetcherRunner:
> //throw exception
> final Output outputDigest = getExecutableManger().getOutputDigest(id);
> .
> .
> .
> } catch (Throwable th) {
> fetchFailed = true; // this could happen when resource store is
> unavailable
> logger.warn("Job Fetcher caught a exception ", th);
> }
> {code}
> When the build2 first processes the job that build1 is running, since
> fetchFailed is true, the job is not in the list of running jobs in build2,
> the job status is running, FetchRunner.jobStateCount() will kill the job, and
> set the running task status to ready, set the job status to error, broadcast.
> {code:java}
> FetchRunner.jobStateCount():
> protected void jobStateCount(String id) {
> final Output outputDigest = getExecutableManger().getOutputDigest(id);
> // logger.debug("Job id:" + id + " not runnable");
> if (outputDigest.getState() == ExecutableState.SUCCEED) {
> nSUCCEED++;
> } else if (outputDigest.getState() == ExecutableState.ERROR) {
> nError++;
> } else if (outputDigest.getState() == ExecutableState.DISCARDED) {
> nDiscarded++;
> } else if (outputDigest.getState() == ExecutableState.STOPPED) {
> nStopped++;
> } else {
> if (fetchFailed) {
> //this code
> getExecutableManger().forceKillJob(id);
> nError++;
> } else {
> nOthers++;
> }
> }
> }
> {code}
> After the first task of the job runs successfully on the build1, the task
> state is ready without change, and the job status is error,and executeResult
> returns successfully, then the job status is changed to ready. The job status
> Ready will not release the zk lock, build1 will continue to schedule the job
> to run, and then be killed by build2, again and again. The build job has not
> been able to run normally
> solve
> There are two problems with FetcherRunner:
> 1. When FechRunnner foreach job, if the metadata of the job part is not
> found, an exception will be thrown. We can skip this job and foreach other
> jobs.
> 2. For DistributedScheduler, even if FetchFailed is true, not in runningJobs,
> the status is running, FetchRunner should not kill the job because the job
> may be scheduler by another kylin service
> This jira solves the problem 1, another jira will solves the problem 2
--
This message was sent by Atlassian Jira
(v8.3.4#803005)