On 4/01/2014 4:14 PM, Greg Trasuk wrote:
On Jan 4, 2014, at 12:52 AM, Peter Firmstone<j...@zeus.net.au> wrote:
On 4/01/2014 3:18 PM, Greg Trasuk wrote:
I’ll also point out Patricia’s recent statement that TaskManager should be
reasonably efficient for small task queues, but less efficient for larger task
queues. We don’t have solid evidence that the task queues ever get large.
Hence, the assertion that “TaskManager doesn’t scale” is meaningless.
No, it's not about scalability, it's about the window of time when a task is removed from
the queue in TaskManager for execution but fails and needs to be retried later.
Task.runAfter doesn't contain the task that "should have executed" so dependant
tasks proceed before their depenencies.
This code comment from ServiceDiscoveryManager might help:
/** This task class, when executed, first registers to receive
* ServiceEvents from the given ServiceRegistrar. If the registration
* process succeeds (no RemoteExceptions), it then executes the
* LookupTask to query the given ServiceRegistrar for a "snapshot"
* of its current state with respect to services that match the
* given template.
*
* Note that the order of execution of the two tasks is important.
* That is, the LookupTask must be executed only after registration
* for events has completed. This is because when an entity registers
* with the event mechanism of a ServiceRegistrar, the entity will
* only receive notification of events that occur "in the future",
* after the registration is made. The entity will not receive events
* about changes to the state of the ServiceRegistrar that may have
* occurred before or during the registration process.
*
* Thus, if the order of these tasks were reversed and the LookupTask
* were to be executed prior to the RegisterListenerTask, then the
* possibility exists for the occurrence of a change in the
* ServiceRegistrar's state between the time the LookupTask retrieves
* a snapshot of that state, and the time the event registration
* process has completed, resulting in an incorrect view of the
* current state of the ServiceRegistrar.
When do you claim that this happens? And what currently happens now that is
unacceptable?
It causes test failures caused by missing events for qa-refactor.
What is the concrete, observable problem that you’re trying to solve, that
justifies introducing failures that require further work?
The 2.2 codebase in it's current form is very brittle, small changes
create unrelated failures due to unfortunate timing. For that reason,
it's very difficult to develop with. For example, Sim made some
straightforward changes that caused the trunk build to fail, the
failures were not caused by Sim's code, but by changes in timing.
So when you try to fix that, it causes another failure, and so on, it's
a snowball effect.
Realising I didn't have a wide enough test base to confirm stability
(even though tests have been run on Solaris Sparc, Solaris x64, Win x64,
Win 32, Linux x64, Linux ARM and OSX x64) I turned to static analysis
(FindBugs) and manual code review to suppliment testing.
So when something looks susceptible to breaking, rather than wait for it
to fail, or if FindBugs locates a race condition, I fix it, this doesn't
change the functionality of the code when it's behaving as expected.
If real usage never requires a large task queue, then scalability isn’t an
issue, and we don’t know whether it ever needs a large task queue.
In any case, removing TaskManager and replacing it with hard-coded
ThreadPoolExecutors moves us farther away from having the capability of a
shared work queue.
So I’m not in favour of this change. I haven’t looked at the other services
or utility classes, but if the changes are similar, I’m also not in favour.
No TaskManager instances have been replaced by ExecutorService, which is set
via configuration. The hard coded part is how to order tasks through the
configuration provided ExecutorService.
In the 2.2 branch implementation of com.sun.jini.reggie.RegistrarImpl there’s a
variable called ‘tasker’ that is not present in the qa_refactor branch’s
implementation.
Yes, tasker has been replaced by two Executors, one for event
notifications, the other for discovery responses, at present these are
hardcoded.
307
<http://svn.apache.org/viewvc/river/jtsk/skunk/qa_refactor/trunk/src/com/sun/jini/reggie/RegistrarImpl.java?annotate=1552606&diff_format=h#l307>
private JoinManager joiner; private volatile JoinManager
joiner; // accessed without lock from DestroyThread
308
<http://svn.apache.org/viewvc/river/jtsk/skunk/qa_refactor/trunk/src/com/sun/jini/reggie/RegistrarImpl.java?annotate=1552606&diff_format=h#l308>
/** Task manager for sending events and discovery responses */
/** Executors for sending events and discovery responses */
309
<http://svn.apache.org/viewvc/river/jtsk/skunk/qa_refactor/trunk/src/com/sun/jini/reggie/RegistrarImpl.java?annotate=1552606&diff_format=h#l309>
private TaskManager tasker; private final ThreadPoolExecutor
eventNotifierExec;
310
<http://svn.apache.org/viewvc/river/jtsk/skunk/qa_refactor/trunk/src/com/sun/jini/reggie/RegistrarImpl.java?annotate=1552606&diff_format=h#l310>
private final ThreadPoolExecutor discoveryResponseExec;
311
<http://svn.apache.org/viewvc/river/jtsk/skunk/qa_refactor/trunk/src/com/sun/jini/reggie/RegistrarImpl.java?annotate=1552606&diff_format=h#l311>
This is how Mahalo does it:
205
<http://svn.apache.org/viewvc/river/jtsk/skunk/qa_refactor/trunk/src/com/sun/jini/mahalo/TxnManagerImplInitializer.java?annotate=1555061&diff_format=h#l205>
settlerpool = (TaskManager) Config.getNonNullEntry(config,
TxnManager.MAHALO, "settlerPool", TaskManager.class, new
TaskManager(settlerthreads, settlertimeout, settlerload));
settlerpool = Config.getNonNullEntry(
206
<http://svn.apache.org/viewvc/river/jtsk/skunk/qa_refactor/trunk/src/com/sun/jini/mahalo/TxnManagerImplInitializer.java?annotate=1555061&diff_format=h#l206>
taskpool = (TaskManager) Config.getNonNullEntry(config,
TxnManager.MAHALO, "taskPool", TaskManager.class, new
TaskManager(taskthreads, tasktimeout, taskload)); config,
207
<http://svn.apache.org/viewvc/river/jtsk/skunk/qa_refactor/trunk/src/com/sun/jini/mahalo/TxnManagerImplInitializer.java?annotate=1555061&diff_format=h#l207>
TxnManager.MAHALO,
208
<http://svn.apache.org/viewvc/river/jtsk/skunk/qa_refactor/trunk/src/com/sun/jini/mahalo/TxnManagerImplInitializer.java?annotate=1555061&diff_format=h#l208>
"settlerPool",
209
<http://svn.apache.org/viewvc/river/jtsk/skunk/qa_refactor/trunk/src/com/sun/jini/mahalo/TxnManagerImplInitializer.java?annotate=1555061&diff_format=h#l209>
ExecutorService.class,
210
<http://svn.apache.org/viewvc/river/jtsk/skunk/qa_refactor/trunk/src/com/sun/jini/mahalo/TxnManagerImplInitializer.java?annotate=1555061&diff_format=h#l210>
new ThreadPoolExecutor(
211
<http://svn.apache.org/viewvc/river/jtsk/skunk/qa_refactor/trunk/src/com/sun/jini/mahalo/TxnManagerImplInitializer.java?annotate=1555061&diff_format=h#l211>
1,
212
<http://svn.apache.org/viewvc/river/jtsk/skunk/qa_refactor/trunk/src/com/sun/jini/mahalo/TxnManagerImplInitializer.java?annotate=1555061&diff_format=h#l212>
settlerthreads,
213
<http://svn.apache.org/viewvc/river/jtsk/skunk/qa_refactor/trunk/src/com/sun/jini/mahalo/TxnManagerImplInitializer.java?annotate=1555061&diff_format=h#l213>
settlertimeout,
214
<http://svn.apache.org/viewvc/river/jtsk/skunk/qa_refactor/trunk/src/com/sun/jini/mahalo/TxnManagerImplInitializer.java?annotate=1555061&diff_format=h#l214>
TimeUnit.MILLISECONDS,
215
<http://svn.apache.org/viewvc/river/jtsk/skunk/qa_refactor/trunk/src/com/sun/jini/mahalo/TxnManagerImplInitializer.java?annotate=1555061&diff_format=h#l215>
new LinkedBlockingQueue<Runnable>(),
216
<http://svn.apache.org/viewvc/river/jtsk/skunk/qa_refactor/trunk/src/com/sun/jini/mahalo/TxnManagerImplInitializer.java?annotate=1555061&diff_format=h#l216>
new NamedThreadFactory("TxnMgr settlerPool", false)
217
<http://svn.apache.org/viewvc/river/jtsk/skunk/qa_refactor/trunk/src/com/sun/jini/mahalo/TxnManagerImplInitializer.java?annotate=1555061&diff_format=h#l217>
)
218
<http://svn.apache.org/viewvc/river/jtsk/skunk/qa_refactor/trunk/src/com/sun/jini/mahalo/TxnManagerImplInitializer.java?annotate=1555061&diff_format=h#l218>
);
219
<http://svn.apache.org/viewvc/river/jtsk/skunk/qa_refactor/trunk/src/com/sun/jini/mahalo/TxnManagerImplInitializer.java?annotate=1555061&diff_format=h#l219>
taskpool = Config.getNonNullEntry(
220
<http://svn.apache.org/viewvc/river/jtsk/skunk/qa_refactor/trunk/src/com/sun/jini/mahalo/TxnManagerImplInitializer.java?annotate=1555061&diff_format=h#l220>
config,
221
<http://svn.apache.org/viewvc/river/jtsk/skunk/qa_refactor/trunk/src/com/sun/jini/mahalo/TxnManagerImplInitializer.java?annotate=1555061&diff_format=h#l221>
TxnManager.MAHALO,
222
<http://svn.apache.org/viewvc/river/jtsk/skunk/qa_refactor/trunk/src/com/sun/jini/mahalo/TxnManagerImplInitializer.java?annotate=1555061&diff_format=h#l222>
"taskPool",
223
<http://svn.apache.org/viewvc/river/jtsk/skunk/qa_refactor/trunk/src/com/sun/jini/mahalo/TxnManagerImplInitializer.java?annotate=1555061&diff_format=h#l223>
ExecutorService.class,
224
<http://svn.apache.org/viewvc/river/jtsk/skunk/qa_refactor/trunk/src/com/sun/jini/mahalo/TxnManagerImplInitializer.java?annotate=1555061&diff_format=h#l224>
new ThreadPoolExecutor(
225
<http://svn.apache.org/viewvc/river/jtsk/skunk/qa_refactor/trunk/src/com/sun/jini/mahalo/TxnManagerImplInitializer.java?annotate=1555061&diff_format=h#l225>
1,
226
<http://svn.apache.org/viewvc/river/jtsk/skunk/qa_refactor/trunk/src/com/sun/jini/mahalo/TxnManagerImplInitializer.java?annotate=1555061&diff_format=h#l226>
taskthreads,
227
<http://svn.apache.org/viewvc/river/jtsk/skunk/qa_refactor/trunk/src/com/sun/jini/mahalo/TxnManagerImplInitializer.java?annotate=1555061&diff_format=h#l227>
tasktimeout,
228
<http://svn.apache.org/viewvc/river/jtsk/skunk/qa_refactor/trunk/src/com/sun/jini/mahalo/TxnManagerImplInitializer.java?annotate=1555061&diff_format=h#l228>
TimeUnit.MILLISECONDS,
229
<http://svn.apache.org/viewvc/river/jtsk/skunk/qa_refactor/trunk/src/com/sun/jini/mahalo/TxnManagerImplInitializer.java?annotate=1555061&diff_format=h#l229>
new LinkedBlockingQueue<Runnable>(),
230
<http://svn.apache.org/viewvc/river/jtsk/skunk/qa_refactor/trunk/src/com/sun/jini/mahalo/TxnManagerImplInitializer.java?annotate=1555061&diff_format=h#l230>
new NamedThreadFactory("TxnMgr taskPool", false)
231
<http://svn.apache.org/viewvc/river/jtsk/skunk/qa_refactor/trunk/src/com/sun/jini/mahalo/TxnManagerImplInitializer.java?annotate=1555061&diff_format=h#l231>
)
232
<http://svn.apache.org/viewvc/river/jtsk/skunk/qa_refactor/trunk/src/com/sun/jini/mahalo/TxnManagerImplInitializer.java?annotate=1555061&diff_format=h#l232>
);
233
<http://svn.apache.org/viewvc/river/jtsk/skunk/qa_refactor/trunk/src/com/sun/jini/mahalo/TxnManagerImplInitializer.java?annotate=1555061&diff_format=h#l233>
One option is to stop worrying about event order at the sender, and figure out
a way of ordering at the recepient.
I don’t understand what you’re saying here.
Task.runAfter dependencies are performed at the sender, even if ordered
correctly in TaskManager, there's no telling which thread from the pool
will be scheduled by the operating system, this can alter task
processing order, network latency can also dictate out of order arrival
at the recepient.
You’re introducing changes that introduce test failures (which is why you’re
asking for help) without a good reason.
The reason is to expose synchronization bugs so they can be observed and fixed.
Randomly changing code is not a rational diagnostic technique.
Which changes do you feel are random? Feel free to ask about why I've
made a particular change, but try to keep questions short, so they're
managable for me.
You’re never going to ship this code unless you stop modifying it.
It is a considerable undertaking, but I'm not in any hurry, it's not yet ready
for release. I understand you're concerned about the considerable number of
changes; there will be plenty of time for compatibility testing.
Also, when you say below,
I'm developing an ExecutorService wrapper that retry's failed tasks in
org.apache.river.impl.thread.SerialExecutorService, by not removing a task from
it's queue until it completes successfully, it prevents any dependant tasks
from running, I would like to use this as a replacement for TaskManager and
RetryTask.
…be careful! You’re getting into the same difficult area as transactional
semantics around messaging. Will you need to provide a “dead task” queue? Do
you need to set a limit on how many times a task get retried? What happens
when that limit is exceeded? Do all tasks have the same limit? Should a task
get notified when it’s exceeded the retry limit? How long should you wait
between retries? Is that number the same for all tasks. Is there some kind of
alarm or notification when tasks end up being retried, or when the dead task
queue becomes full?
Since most dependencies appear to be based on who the recipient is, if that
recepient is not contactable in spite of considerable effort, we should abandon
futher attempts to do so. At present RetryTask will continue to attempt to
make contact every 5 minutes.
I don’t understand what you’re saying here.
RetryTask is required because of unreliable network connections, if we
send a sequence of events to a remote host, we are currently trying to
send them in order. If one task fails briefly, then the events will
likely arrive out of order at the recipient when the failed task is
later retried.
Regards,
Peter.