There are currently a couple of issues open regarding the way background
tasks are run and i have some thoughts on how to fix them but need input
on a couple of things. I'm going to break things down by issue ...
1. Task leasing causes "time drifting" problems in scheduled tasks.
ROL-1446.
This is actually a bit tough to explain, but basically the problem is
that it takes some amount of time between when a task starts on a thread
and when it actually obtains a lease in the db and by not accounting for
that in our leasing checks we are causing problems with task running.
i.e. assume a task starts at midnight every night but obtains a lease 5
seconds after midnight on day 1. then on day 2 if it tries to obtain
that lease only 2 seconds after midnight it will fail because the full
24hrs of task interval time has not elapsed yet. This is pretty easy to
replicate by setting a task with interval = 1 and letting it run for a
while with DEBUG logging enabled.
To fix this problem we basically need to calculate the next allowed run
time for tasks using a time value other than the recorded lease
acquisition times. We already have a "lastrun" column in the tasklock
table which can be used for this so the real key is to make sure that
column accurately reflects the expected time a task is supposed to run,
not the real time a task runs which is always going to have some
fractional amount of processing time added to it. i.e. if a task is
supposed to run daily at midnight then the lastrun time should always be
00:00:00.000 on a given date rather than 00:00:00.334 or 00:00:01.121,
etc, etc.
2. Tasks should be configured in a central location, not per node.
Generally speaking this would only affect people running Roller in a
cluster, but it makes no real sense to allow task scheduling to be
configured independently per node in a cluster because the scheduling is
required to be synchronized across all nodes. i.e. if nodeA wants to
run every 15 minutes and nodeB wants to run every 10 minutes then if
nodeA runs 10 minutes after the last run time for the task should it
execute?
In any case, that is a bit of a mess and the proper thing to do is to
ensure that all task scheduling is configured in one place (the db) and
all nodes can just read their config from there and schedule themselves
accordingly. The other major benefit here is that it's easier for
admins to see all the available tasks and decide which ones to run and
at what intervals since we would do it all in a GUI rather than a config
file. And of course, with a GUI we can also report status, so that
admins could see when tasks last ran, etc.
So, to fix this part of the problem we would basically need to take the
current task configuration properties that are done via config file and
put them in the db in the roller_tasklock table, which would include 4
new columns for ... "enabled", "starttime", "interval", and
"leaseduration". We would then also need to setup a new page on the
admin UI menu for viewing the tasks and altering their settings.
I didn't really want to make any more changes like this for 4.0, but the
current process is broken and so far this looks like the most
appropriate thing to do. I don't think this is a whole lot of work
since the task stuff is fairly independent and doesn't affect any other
pieces of the code, so we are really just talking about adding a few
things to the pojo and a new struts2 action & jsp.
I should also note that this will help out with #1 because it'll make it
easier and more consistent to be able to track the expected run time of
each task.
3. Synchronize task scheduling against a single clock (the db). ROL-1294.
This one I am less inclined to do, but technically it still makes a fair
amount of sense. The reason this one is an issue is because the way
that tasks are run is based on multiple independent schedulers, one per
node in a cluster, and with various schedulers the task running can get
out of sync. Schedulers can get out of sync in a couple of ways ...
First is just typical system clock drifting. If you have 3 machines, A
and B which are application nodes and C which is the db then all 3 need
to be in perfect time sync for scheduling to stay consistent across the
app. If any of the clocks on these systems becomes fast or slow it can
cause the tasks to not run at their scheduled times or to miss acquiring
leases because a tasks interval time hasn't elapsed properly. This is
relatively unlikely to occur, but it is a concern.
Second, and more likely, is that if you reconfigure the task schedules
via a UI as described in #2 or even just by config file then the
schedulers for other nodes will be out of sync.
To accomplish this one I think we would have to do our own scheduling
based off the database time, so we would start up a thread in the
application which would check once per minute if it was time to run any
tasks and if so it would start the tasks on new threads. Not entirely
complicated, but it's still new code.
4. Tasks mysteriously stop running.
This one I have no info about other than to say I have very regularly
seen a problem where a task will stop running and for no apparent
reason. i.e. no errors in logs or anything, it just stops running.
Through some debugging I have noted that it seems that when this happens
the task is not even being started on a thread, so it's as if the task
gets dropped from the scheduler all together because there is no
debugging available indicating the task even tried to start.
This would be the main thing that would drive me to want to do #3
assuming I can't figure out why the tasks mysteriously stop running.
So that's all the issues and a summary of proposed solutions. My
feeling is that #1 is critical and has to be fixed for 4.0 somehow, #2
is a good thing that I'd like to see but I think we could technically do
without it, #3 is nice but if I could avoid doing it I would, and #4 is
critical and has to be fixed for 4.0.
Thoughts? Comments? Opinions?
I am trying to wrap up 4.0 work ASAP so please respond as soon as
possible because I am already beginning to work on these things.
-- Allen