issues with scheduled tasks: ROL-1294, ROL-1446

Allen Gilliland Mon, 23 Jul 2007 15:52:55 -0700

There are currently a couple of issues open regarding the way backgroundtasks are run and i have some thoughts on how to fix them but need inputon a couple of things. I'm going to break things down by issue ...

1. Task leasing causes "time drifting" problems in scheduled tasks.ROL-1446.

This is actually a bit tough to explain, but basically the problem isthat it takes some amount of time between when a task starts on a threadand when it actually obtains a lease in the db and by not accounting forthat in our leasing checks we are causing problems with task running.i.e. assume a task starts at midnight every night but obtains a lease 5seconds after midnight on day 1. then on day 2 if it tries to obtainthat lease only 2 seconds after midnight it will fail because the full24hrs of task interval time has not elapsed yet. This is pretty easy toreplicate by setting a task with interval = 1 and letting it run for awhile with DEBUG logging enabled.

To fix this problem we basically need to calculate the next allowed runtime for tasks using a time value other than the recorded leaseacquisition times. We already have a "lastrun" column in the tasklocktable which can be used for this so the real key is to make sure thatcolumn accurately reflects the expected time a task is supposed to run,not the real time a task runs which is always going to have somefractional amount of processing time added to it. i.e. if a task issupposed to run daily at midnight then the lastrun time should always be00:00:00.000 on a given date rather than 00:00:00.334 or 00:00:01.121,etc, etc.



2. Tasks should be configured in a central location, not per node.

Generally speaking this would only affect people running Roller in acluster, but it makes no real sense to allow task scheduling to beconfigured independently per node in a cluster because the scheduling isrequired to be synchronized across all nodes. i.e. if nodeA wants torun every 15 minutes and nodeB wants to run every 10 minutes then ifnodeA runs 10 minutes after the last run time for the task should itexecute?

In any case, that is a bit of a mess and the proper thing to do is toensure that all task scheduling is configured in one place (the db) andall nodes can just read their config from there and schedule themselvesaccordingly. The other major benefit here is that it's easier foradmins to see all the available tasks and decide which ones to run andat what intervals since we would do it all in a GUI rather than a configfile. And of course, with a GUI we can also report status, so thatadmins could see when tasks last ran, etc.

So, to fix this part of the problem we would basically need to take thecurrent task configuration properties that are done via config file andput them in the db in the roller_tasklock table, which would include 4new columns for ... "enabled", "starttime", "interval", and"leaseduration". We would then also need to setup a new page on theadmin UI menu for viewing the tasks and altering their settings.

I didn't really want to make any more changes like this for 4.0, but thecurrent process is broken and so far this looks like the mostappropriate thing to do. I don't think this is a whole lot of worksince the task stuff is fairly independent and doesn't affect any otherpieces of the code, so we are really just talking about adding a fewthings to the pojo and a new struts2 action & jsp.

I should also note that this will help out with #1 because it'll make iteasier and more consistent to be able to track the expected run time ofeach task.



3. Synchronize task scheduling against a single clock (the db). ROL-1294.

This one I am less inclined to do, but technically it still makes a fairamount of sense. The reason this one is an issue is because the waythat tasks are run is based on multiple independent schedulers, one pernode in a cluster, and with various schedulers the task running can getout of sync. Schedulers can get out of sync in a couple of ways ...

First is just typical system clock drifting. If you have 3 machines, Aand B which are application nodes and C which is the db then all 3 needto be in perfect time sync for scheduling to stay consistent across theapp. If any of the clocks on these systems becomes fast or slow it cancause the tasks to not run at their scheduled times or to miss acquiringleases because a tasks interval time hasn't elapsed properly. This isrelatively unlikely to occur, but it is a concern.

Second, and more likely, is that if you reconfigure the task schedulesvia a UI as described in #2 or even just by config file then theschedulers for other nodes will be out of sync.

To accomplish this one I think we would have to do our own schedulingbased off the database time, so we would start up a thread in theapplication which would check once per minute if it was time to run anytasks and if so it would start the tasks on new threads. Not entirelycomplicated, but it's still new code.



4. Tasks mysteriously stop running.

This one I have no info about other than to say I have very regularlyseen a problem where a task will stop running and for no apparentreason. i.e. no errors in logs or anything, it just stops running.Through some debugging I have noted that it seems that when this happensthe task is not even being started on a thread, so it's as if the taskgets dropped from the scheduler all together because there is nodebugging available indicating the task even tried to start.

This would be the main thing that would drive me to want to do #3assuming I can't figure out why the tasks mysteriously stop running.

So that's all the issues and a summary of proposed solutions. Myfeeling is that #1 is critical and has to be fixed for 4.0 somehow, #2is a good thing that I'd like to see but I think we could technically dowithout it, #3 is nice but if I could avoid doing it I would, and #4 iscritical and has to be fixed for 4.0.


Thoughts?  Comments?  Opinions?

I am trying to wrap up 4.0 work ASAP so please respond as soon aspossible because I am already beginning to work on these things.


-- Allen

issues with scheduled tasks: ROL-1294, ROL-1446

Reply via email to