Task leasing causes scheduling inconsistencies
----------------------------------------------
Key: ROL-1446
URL:
http://opensource.atlassian.com/projects/roller/browse/ROL-1446
Project: Roller
Issue Type: Bug
Affects Versions: 3.1
Reporter: Allen Gilliland
Assignee: Roller Unassigned
After a bit more poking around I have realized that some of the problems I've
seen with the task scheduling is actually being caused by the leasing process
we are using. The root of the problem is that the task scheduling is not
properly synchronized with the leasing process and therefore scheduling drift
happens.
An example. Assume that a task is scheduled to run once per minute starting
00:00:00.50. This will mean that the subsequent run times for the task will be
00:01:00.50, 00:02:00.50, etc, etc. Now take into account the fact that in the
database the leasing time of a task is defined by the time the task obtained a
lease on db time, and that time is some amount of time after the time the
actual task was started. So lets assume for a moment that it takes 700ms to
obtain a lease via the db. This means that the time the db thinks a task is
run is different than the time the app thinks the task is run, and in our
particular example the actual clock difference will be 1 second (00:00:00.50 +
700ms = 00:00:01.20). What this means is that when the application runs the
task the next time at 00:01:00.50 and tries to obtain a new lease it will be
refused because the db thinks the last run time for the task was at 00:00:01.20
which is less than 60 seconds from 00:01:00.50. So this means that the
additional time required to obtain a lease in the db can actually cause the
lease time to be off by 1 or more seconds and therefore cause a subsequent run
of the task to fail.
I have seen this exact problem occur with jobs meant to run once daily where
the job runs at just after midnight, obtains a lease at 00:00:01.xxx seconds
and runs, and then the following day the task fails to run because the app
thinks that the interval time for the task has not yet elapsed.
Sorting this out will require better alignment of the clocks and timestamps
stored in this process and this is the best option I can come up with right now
...
When a task successfully obtains a lease and runs it must keep track of the
exact time the task was first initiated, then when the task completes and
releases its lease it stores that time in db as the last time the lease was
acquired. This would basically be a fairly simple attempt at properly
adjusting the lease time stored in the db so that it does not include the
additional amount of time required to process obtaining the lease. So an
example would be that if a task is set to run hourly starting at 05:00 and it
obtains its lease at 05:01.20 then when the task completes we would subtract
the 01.20 seconds from the time stored in the db so that the db properly
reflects the time the task was run, not the time the lease was obtained.
I am sure there are other ways to better synchronize the multiple clocks
involved when doing clustered task scheduling, but at the end of the day it's
apparent that part of the solution is going to have to involve properly
accounting for the extra time that gets used up to obtain a lease so that
scheduling doesn't drift.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://opensource.atlassian.com/projects/roller/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira