Task leasing causes scheduling inconsistencies
----------------------------------------------

                 Key: ROL-1446
                 URL: 
http://opensource.atlassian.com/projects/roller/browse/ROL-1446
             Project: Roller
          Issue Type: Bug
    Affects Versions: 3.1
            Reporter: Allen Gilliland
            Assignee: Roller Unassigned


After a bit more poking around I have realized that some of the problems I've 
seen with the task scheduling is actually being caused by the leasing process 
we are using.  The root of the problem is that the task scheduling is not 
properly synchronized with the leasing process and therefore scheduling drift 
happens.

An example.  Assume that a task is scheduled to run once per minute starting 
00:00:00.50.  This will mean that the subsequent run times for the task will be 
00:01:00.50, 00:02:00.50, etc, etc.  Now take into account the fact that in the 
database the leasing time of a task is defined by the time the task obtained a 
lease on db time, and that time is some amount of time after the time the 
actual task was started.  So lets assume for a moment that it takes 700ms to 
obtain a lease via the db.  This means that the time the db thinks a task is 
run is different than the time the app thinks the task is run, and in our 
particular example the actual clock difference will be 1 second (00:00:00.50 + 
700ms = 00:00:01.20).  What this means is that when the application runs the 
task the next time at 00:01:00.50 and tries to obtain a new lease it will be 
refused because the db thinks the last run time for the task was at 00:00:01.20 
which is less than 60 seconds from 00:01:00.50.  So this means that the 
additional time required to obtain a lease in the db can actually cause the 
lease time to be off by 1 or more seconds and therefore cause a subsequent run 
of the task to fail.

I have seen this exact problem occur with jobs meant to run once daily where 
the job runs at just after midnight, obtains a lease at 00:00:01.xxx seconds 
and runs, and then the following day the task fails to run because the app 
thinks that the interval time for the task has not yet elapsed.

Sorting this out will require better alignment of the clocks and timestamps 
stored in this process and this is the best option I can come up with right now 
...

When a task successfully obtains a lease and runs it must keep track of the 
exact time the task was first initiated, then when the task completes and 
releases its lease it stores that time in db as the last time the lease was 
acquired.  This would basically be a fairly simple attempt at properly 
adjusting the lease time stored in the db so that it does not include the 
additional amount of time required to process obtaining the lease.  So an 
example would be that if a task is set to run hourly starting at 05:00 and it 
obtains its lease at 05:01.20 then when the task completes we would subtract 
the 01.20 seconds from the time stored in the db so that the db properly 
reflects the time the task was run, not the time the lease was obtained.

I am sure there are other ways to better synchronize the multiple clocks 
involved when doing clustered task scheduling, but at the end of the day it's 
apparent that part of the solution is going to have to involve properly 
accounting for the extra time that gets used up to obtain a lease so that 
scheduling doesn't drift.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://opensource.atlassian.com/projects/roller/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira


Reply via email to