[jira] [Comment Edited] (KYLIN-5857) Fix job scheduler related problems

pengfei.zhan (Jira) Sun, 14 Jul 2024 20:13:54 -0700


    [ 
https://issues.apache.org/jira/browse/KYLIN-5857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17865843#comment-17865843
 ]


pengfei.zhan edited comment on KYLIN-5857 at 7/15/24 3:12 AM:
--------------------------------------------------------------

h1. Root Cause

Inductotherm job nodes have two roles: master and slave. The master node holds 
the project epoch. When the master triggers a task and writes it to the job 
lock table, it updates the task with its own IP. If the master node executes 
the job, no issue occurs. However, if a slave node executes the task, Spark 
calls the KE API to update the stage state during task execution. This update 
is routed to the master node based on the epoch, and the master node then 
updates its own IP in the task metadata when it updates the task output.
h1. Dev Design

When any node updates the task output, it should only update its IP in the task 
metadata if the task is being executed on the current node. During task 
execution, the KE API called by Spark should not need to be routed based on the 
epoch.

When attempting to resume a running task, first acquire the lock on the 
`jobLock` to ensure that no other nodes are executing the task.

Add a new parameter, `kylin.job.max-transaction-retry`, with a default value of 
3.

The dataloading service checks for running tasks on the current node, as well 
as for errored and suspended tasks, using timed tasks.

 


was (Author: JIRAUSER294653):
h1. Root Cause


Inductotherm job nodes have two roles: master and slave. The master node holds 
the project epoch. When the master triggers a task and writes it to the job 
lock table, it updates the task with its own IP. If the master node executes 
the job, no issue occurs. However, if a slave node executes the task, Spark 
calls the KE API to update the stage state during task execution. This update 
is routed to the master node based on the epoch, and the master node then 
updates its own IP in the task metadata when it updates the task output.
h1. Dev Design


When any node updates the task output, it should only update its IP in the task 
metadata if the task is being executed on the current node. During task 
execution, the KE API called by Spark should not need to be routed based on the 
epoch.

> Fix job scheduler related problems
> ----------------------------------
>
>                 Key: KYLIN-5857
>                 URL: https://issues.apache.org/jira/browse/KYLIN-5857
>             Project: Kylin
>          Issue Type: Bug
>          Components: Job Engine
>    Affects Versions: 5.0.0
>            Reporter: pengfei.zhan
>            Assignee: pengfei.zhan
>            Priority: Major
>             Fix For: 5.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (KYLIN-5857) Fix job scheduler related problems

Reply via email to