[ https://issues.apache.org/jira/browse/KAFKA-7790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Colin P. McCabe updated KAFKA-7790: ----------------------------------- Description: If an Agent process is restarted, it will be re-sent the worker specifications for any tasks that are not DONE. The agent will run these tasks for the original time period. It should be fixed to run them only for the remaining time. (was: All Trogdor task specifications have a defined `startMs` and `durationMs`. Under conditions of task failure and restarts, it is intuitive to assume that a task would not be re-ran after a certain time period. Let's best illustrate the issue with an example: {code:java} startMs = 12PM; durationMs = 1hour; # 12:02 - Coordinator schedules a task to run on agent-0 # 12:45 - agent-0 process dies. Coordinator's heartbeats to agent-0 fail. # 12:47 - agent-0 comes back up. Coordinator's heartbeats pass and it re-schedules tasks that are not running in agent-0 # 13:20 - agent-0 process dies. # 13:22 - agent-0 comes back up. Coordinator re-schedules task{code} This can result in an endless loop of task rescheduling. If there are more tasks scheduled on agent-0 (e.g a task scheduled to start each on hour), we can end up in a scenario where we overwhelm the agent with tasks that we would rather have dropped. h2. Changes We propose that the Trogdor Coordinator does not re-schedule a task if the current time of re-scheduling is greater than the start time of the task and its duration combined. More specifically: {code:java} if (currentTimeMs > startTimeMs + durationTimeMs) scheduleTask() else failTask(){code} ) > Fix Bugs in Trogdor Task Expiration > ----------------------------------- > > Key: KAFKA-7790 > URL: https://issues.apache.org/jira/browse/KAFKA-7790 > Project: Kafka > Issue Type: Improvement > Reporter: Stanislav Kozlovski > Assignee: Stanislav Kozlovski > Priority: Major > > If an Agent process is restarted, it will be re-sent the worker > specifications for any tasks that are not DONE. The agent will run these > tasks for the original time period. It should be fixed to run them only for > the remaining time. -- This message was sent by Atlassian JIRA (v7.6.3#76005)