Re: issues with scheduled tasks: ROL-1294, ROL-1446

Allen Gilliland Tue, 24 Jul 2007 10:46:44 -0700

A bit of an update based on some prototype work that I did yesterday andtested over night ...

The use of the java.util.concurrent.ScheduledExecutorService to runningthe tasks is looking more and more iffy. I did some cleanup on the codeyesterday and ran it overnight with debugging enabled and for a whilethings looked like they were working great, but then randomly thescheduler moved the execution time of the tasks forward by 1.4 seconds ...

DEBUG 2007-07-24 05:54:00,000 RollerTaskWithLeasing:run - Attempting toacquire

lease

DEBUG 2007-07-24 05:54:00,001 RollerTaskWithLeasing:run - Attempting toacquire

lease

DEBUG 2007-07-24 05:54:58,600 RollerTaskWithLeasing:run - Attempting toacquire

lease

DEBUG 2007-07-24 05:54:58,600 RollerTaskWithLeasing:run - Attempting toacquire

lease

I then did some poking around on forums, etc and found that theScheduledExecutorService is not held in particularly high regard, evenby the JSR team which added those utilities. This post in the forums byDavid Holmes (one of the JSR 166 expert group members) has me feelinguneasy ... "ScheduledThreadPoolExecutor has a number of nasty bugs - oneof which is that cancelling tasks tends to terminate the pool thread.You should never run out of threads but you may run with fewer thanexpected threads and incur extra thread creation/teardown overhead."

In light of this info it's looking like doing #3 below, basically doingour own scheduler synchronized against the db time may be the best option.


-- Allen


Allen Gilliland wrote:

There are currently a couple of issues open regarding the way backgroundtasks are run and i have some thoughts on how to fix them but need inputon a couple of things. I'm going to break things down by issue ...
1. Task leasing causes "time drifting" problems in scheduled tasks.ROL-1446.
This is actually a bit tough to explain, but basically the problem isthat it takes some amount of time between when a task starts on a threadand when it actually obtains a lease in the db and by not accounting forthat in our leasing checks we are causing problems with task running.i.e. assume a task starts at midnight every night but obtains a lease 5seconds after midnight on day 1. then on day 2 if it tries to obtainthat lease only 2 seconds after midnight it will fail because the full24hrs of task interval time has not elapsed yet. This is pretty easy toreplicate by setting a task with interval = 1 and letting it run for awhile with DEBUG logging enabled.
To fix this problem we basically need to calculate the next allowed runtime for tasks using a time value other than the recorded leaseacquisition times. We already have a "lastrun" column in the tasklocktable which can be used for this so the real key is to make sure thatcolumn accurately reflects the expected time a task is supposed to run,not the real time a task runs which is always going to have somefractional amount of processing time added to it. i.e. if a task issupposed to run daily at midnight then the lastrun time should always be00:00:00.000 on a given date rather than 00:00:00.334 or 00:00:01.121,etc, etc.
2. Tasks should be configured in a central location, not per node.
Generally speaking this would only affect people running Roller in acluster, but it makes no real sense to allow task scheduling to beconfigured independently per node in a cluster because the scheduling isrequired to be synchronized across all nodes. i.e. if nodeA wants torun every 15 minutes and nodeB wants to run every 10 minutes then ifnodeA runs 10 minutes after the last run time for the task should itexecute?
In any case, that is a bit of a mess and the proper thing to do is toensure that all task scheduling is configured in one place (the db) andall nodes can just read their config from there and schedule themselvesaccordingly. The other major benefit here is that it's easier foradmins to see all the available tasks and decide which ones to run andat what intervals since we would do it all in a GUI rather than a configfile. And of course, with a GUI we can also report status, so thatadmins could see when tasks last ran, etc.
So, to fix this part of the problem we would basically need to take thecurrent task configuration properties that are done via config file andput them in the db in the roller_tasklock table, which would include 4new columns for ... "enabled", "starttime", "interval", and"leaseduration". We would then also need to setup a new page on theadmin UI menu for viewing the tasks and altering their settings.
I didn't really want to make any more changes like this for 4.0, but thecurrent process is broken and so far this looks like the mostappropriate thing to do. I don't think this is a whole lot of worksince the task stuff is fairly independent and doesn't affect any otherpieces of the code, so we are really just talking about adding a fewthings to the pojo and a new struts2 action & jsp.
I should also note that this will help out with #1 because it'll make iteasier and more consistent to be able to track the expected run time ofeach task.
3. Synchronize task scheduling against a single clock (the db). ROL-1294.
This one I am less inclined to do, but technically it still makes a fairamount of sense. The reason this one is an issue is because the waythat tasks are run is based on multiple independent schedulers, one pernode in a cluster, and with various schedulers the task running can getout of sync. Schedulers can get out of sync in a couple of ways ...
First is just typical system clock drifting. If you have 3 machines, Aand B which are application nodes and C which is the db then all 3 needto be in perfect time sync for scheduling to stay consistent across theapp. If any of the clocks on these systems becomes fast or slow it cancause the tasks to not run at their scheduled times or to miss acquiringleases because a tasks interval time hasn't elapsed properly. This isrelatively unlikely to occur, but it is a concern.
Second, and more likely, is that if you reconfigure the task schedulesvia a UI as described in #2 or even just by config file then theschedulers for other nodes will be out of sync.
To accomplish this one I think we would have to do our own schedulingbased off the database time, so we would start up a thread in theapplication which would check once per minute if it was time to run anytasks and if so it would start the tasks on new threads. Not entirelycomplicated, but it's still new code.
4. Tasks mysteriously stop running.
This one I have no info about other than to say I have very regularlyseen a problem where a task will stop running and for no apparentreason. i.e. no errors in logs or anything, it just stops running.Through some debugging I have noted that it seems that when this happensthe task is not even being started on a thread, so it's as if the taskgets dropped from the scheduler all together because there is nodebugging available indicating the task even tried to start.
This would be the main thing that would drive me to want to do #3assuming I can't figure out why the tasks mysteriously stop running.
So that's all the issues and a summary of proposed solutions. Myfeeling is that #1 is critical and has to be fixed for 4.0 somehow, #2is a good thing that I'd like to see but I think we could technically dowithout it, #3 is nice but if I could avoid doing it I would, and #4 iscritical and has to be fixed for 4.0.
Thoughts?  Comments?  Opinions?
I am trying to wrap up 4.0 work ASAP so please respond as soon aspossible because I am already beginning to work on these things.
-- Allen

Re: issues with scheduled tasks: ROL-1294, ROL-1446

Reply via email to