[ 
https://issues.apache.org/jira/browse/KAFKA-7790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin P. McCabe updated KAFKA-7790:
-----------------------------------
    Description: If an Agent process is restarted, it will be re-sent the 
worker specifications for any tasks that are not DONE.  The agent will run 
these tasks for the original time period.  It should be fixed to run them only 
for the remaining time.  (was: All Trogdor task specifications have a defined 
`startMs` and `durationMs`. Under conditions of task failure and restarts, it 
is intuitive to assume that a task would not be re-ran after a certain time 
period.

Let's best illustrate the issue with an example:
{code:java}
startMs = 12PM; durationMs = 1hour;
# 12:02 - Coordinator schedules a task to run on agent-0
# 12:45 - agent-0 process dies. Coordinator's heartbeats to agent-0 fail.
# 12:47 - agent-0 comes back up. Coordinator's heartbeats pass and it 
re-schedules tasks that are not running in agent-0
# 13:20 - agent-0 process dies.
# 13:22 - agent-0 comes back up. Coordinator re-schedules task{code}
This can result in an endless loop of task rescheduling. If there are more 
tasks scheduled on agent-0 (e.g a task scheduled to start each on hour), we can 
end up in a scenario where we overwhelm the agent with tasks that we would 
rather have dropped.
h2. Changes

We propose that the Trogdor Coordinator does not re-schedule a task if the 
current time of re-scheduling is greater than the start time of the task and 
its duration combined. More specifically:
{code:java}
if (currentTimeMs > startTimeMs + durationTimeMs)
  scheduleTask()
else
  failTask(){code}
 

 

 )

> Fix Bugs in Trogdor Task Expiration
> -----------------------------------
>
>                 Key: KAFKA-7790
>                 URL: https://issues.apache.org/jira/browse/KAFKA-7790
>             Project: Kafka
>          Issue Type: Improvement
>            Reporter: Stanislav Kozlovski
>            Assignee: Stanislav Kozlovski
>            Priority: Major
>
> If an Agent process is restarted, it will be re-sent the worker 
> specifications for any tasks that are not DONE.  The agent will run these 
> tasks for the original time period.  It should be fixed to run them only for 
> the remaining time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to