A bit of an update based on some prototype work that I did yesterday and tested over night ...

The use of the java.util.concurrent.ScheduledExecutorService to running the tasks is looking more and more iffy. I did some cleanup on the code yesterday and ran it overnight with debugging enabled and for a while things looked like they were working great, but then randomly the scheduler moved the execution time of the tasks forward by 1.4 seconds ...

DEBUG 2007-07-24 05:54:00,000 RollerTaskWithLeasing:run - Attempting to acquire
lease
DEBUG 2007-07-24 05:54:00,001 RollerTaskWithLeasing:run - Attempting to acquire
lease


DEBUG 2007-07-24 05:54:58,600 RollerTaskWithLeasing:run - Attempting to acquire
lease
DEBUG 2007-07-24 05:54:58,600 RollerTaskWithLeasing:run - Attempting to acquire
lease

I then did some poking around on forums, etc and found that the ScheduledExecutorService is not held in particularly high regard, even by the JSR team which added those utilities. This post in the forums by David Holmes (one of the JSR 166 expert group members) has me feeling uneasy ... "ScheduledThreadPoolExecutor has a number of nasty bugs - one of which is that cancelling tasks tends to terminate the pool thread. You should never run out of threads but you may run with fewer than expected threads and incur extra thread creation/teardown overhead."

In light of this info it's looking like doing #3 below, basically doing our own scheduler synchronized against the db time may be the best option.

-- Allen


Allen Gilliland wrote:
There are currently a couple of issues open regarding the way background tasks are run and i have some thoughts on how to fix them but need input on a couple of things. I'm going to break things down by issue ...

1. Task leasing causes "time drifting" problems in scheduled tasks. ROL-1446.

This is actually a bit tough to explain, but basically the problem is that it takes some amount of time between when a task starts on a thread and when it actually obtains a lease in the db and by not accounting for that in our leasing checks we are causing problems with task running. i.e. assume a task starts at midnight every night but obtains a lease 5 seconds after midnight on day 1. then on day 2 if it tries to obtain that lease only 2 seconds after midnight it will fail because the full 24hrs of task interval time has not elapsed yet. This is pretty easy to replicate by setting a task with interval = 1 and letting it run for a while with DEBUG logging enabled.

To fix this problem we basically need to calculate the next allowed run time for tasks using a time value other than the recorded lease acquisition times. We already have a "lastrun" column in the tasklock table which can be used for this so the real key is to make sure that column accurately reflects the expected time a task is supposed to run, not the real time a task runs which is always going to have some fractional amount of processing time added to it. i.e. if a task is supposed to run daily at midnight then the lastrun time should always be 00:00:00.000 on a given date rather than 00:00:00.334 or 00:00:01.121, etc, etc.


2. Tasks should be configured in a central location, not per node.

Generally speaking this would only affect people running Roller in a cluster, but it makes no real sense to allow task scheduling to be configured independently per node in a cluster because the scheduling is required to be synchronized across all nodes. i.e. if nodeA wants to run every 15 minutes and nodeB wants to run every 10 minutes then if nodeA runs 10 minutes after the last run time for the task should it execute?

In any case, that is a bit of a mess and the proper thing to do is to ensure that all task scheduling is configured in one place (the db) and all nodes can just read their config from there and schedule themselves accordingly. The other major benefit here is that it's easier for admins to see all the available tasks and decide which ones to run and at what intervals since we would do it all in a GUI rather than a config file. And of course, with a GUI we can also report status, so that admins could see when tasks last ran, etc.

So, to fix this part of the problem we would basically need to take the current task configuration properties that are done via config file and put them in the db in the roller_tasklock table, which would include 4 new columns for ... "enabled", "starttime", "interval", and "leaseduration". We would then also need to setup a new page on the admin UI menu for viewing the tasks and altering their settings.

I didn't really want to make any more changes like this for 4.0, but the current process is broken and so far this looks like the most appropriate thing to do. I don't think this is a whole lot of work since the task stuff is fairly independent and doesn't affect any other pieces of the code, so we are really just talking about adding a few things to the pojo and a new struts2 action & jsp.

I should also note that this will help out with #1 because it'll make it easier and more consistent to be able to track the expected run time of each task.


3. Synchronize task scheduling against a single clock (the db). ROL-1294.

This one I am less inclined to do, but technically it still makes a fair amount of sense. The reason this one is an issue is because the way that tasks are run is based on multiple independent schedulers, one per node in a cluster, and with various schedulers the task running can get out of sync. Schedulers can get out of sync in a couple of ways ...

First is just typical system clock drifting. If you have 3 machines, A and B which are application nodes and C which is the db then all 3 need to be in perfect time sync for scheduling to stay consistent across the app. If any of the clocks on these systems becomes fast or slow it can cause the tasks to not run at their scheduled times or to miss acquiring leases because a tasks interval time hasn't elapsed properly. This is relatively unlikely to occur, but it is a concern.

Second, and more likely, is that if you reconfigure the task schedules via a UI as described in #2 or even just by config file then the schedulers for other nodes will be out of sync.

To accomplish this one I think we would have to do our own scheduling based off the database time, so we would start up a thread in the application which would check once per minute if it was time to run any tasks and if so it would start the tasks on new threads. Not entirely complicated, but it's still new code.


4. Tasks mysteriously stop running.

This one I have no info about other than to say I have very regularly seen a problem where a task will stop running and for no apparent reason. i.e. no errors in logs or anything, it just stops running. Through some debugging I have noted that it seems that when this happens the task is not even being started on a thread, so it's as if the task gets dropped from the scheduler all together because there is no debugging available indicating the task even tried to start.

This would be the main thing that would drive me to want to do #3 assuming I can't figure out why the tasks mysteriously stop running.


So that's all the issues and a summary of proposed solutions. My feeling is that #1 is critical and has to be fixed for 4.0 somehow, #2 is a good thing that I'd like to see but I think we could technically do without it, #3 is nice but if I could avoid doing it I would, and #4 is critical and has to be fixed for 4.0.

Thoughts?  Comments?  Opinions?

I am trying to wrap up 4.0 work ASAP so please respond as soon as possible because I am already beginning to work on these things.

-- Allen

Reply via email to