[
https://issues.apache.org/jira/browse/KYLIN-5857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17865843#comment-17865843
]
pengfei.zhan edited comment on KYLIN-5857 at 7/15/24 3:12 AM:
--------------------------------------------------------------
h1. Root Cause
Inductotherm job nodes have two roles: master and slave. The master node holds
the project epoch. When the master triggers a task and writes it to the job
lock table, it updates the task with its own IP. If the master node executes
the job, no issue occurs. However, if a slave node executes the task, Spark
calls the KE API to update the stage state during task execution. This update
is routed to the master node based on the epoch, and the master node then
updates its own IP in the task metadata when it updates the task output.
h1. Dev Design
When any node updates the task output, it should only update its IP in the task
metadata if the task is being executed on the current node. During task
execution, the KE API called by Spark should not need to be routed based on the
epoch.
When attempting to resume a running task, first acquire the lock on the
`jobLock` to ensure that no other nodes are executing the task.
Add a new parameter, `kylin.job.max-transaction-retry`, with a default value of
3.
The dataloading service checks for running tasks on the current node, as well
as for errored and suspended tasks, using timed tasks.
was (Author: JIRAUSER294653):
h1. Root Cause
Inductotherm job nodes have two roles: master and slave. The master node holds
the project epoch. When the master triggers a task and writes it to the job
lock table, it updates the task with its own IP. If the master node executes
the job, no issue occurs. However, if a slave node executes the task, Spark
calls the KE API to update the stage state during task execution. This update
is routed to the master node based on the epoch, and the master node then
updates its own IP in the task metadata when it updates the task output.
h1. Dev Design
When any node updates the task output, it should only update its IP in the task
metadata if the task is being executed on the current node. During task
execution, the KE API called by Spark should not need to be routed based on the
epoch.
> Fix job scheduler related problems
> ----------------------------------
>
> Key: KYLIN-5857
> URL: https://issues.apache.org/jira/browse/KYLIN-5857
> Project: Kylin
> Issue Type: Bug
> Components: Job Engine
> Affects Versions: 5.0.0
> Reporter: pengfei.zhan
> Assignee: pengfei.zhan
> Priority: Major
> Fix For: 5.0.0
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)