Re: Build failed in Jenkins: river-qa-refactor-win #45

Peter Firmstone Fri, 03 Jan 2014 23:04:09 -0800

On 4/01/2014 4:14 PM, Greg Trasuk wrote:

On Jan 4, 2014, at 12:52 AM, Peter Firmstone<[email protected]>  wrote:

On 4/01/2014 3:18 PM, Greg Trasuk wrote:

I’ll also point out Patricia’s recent statement that TaskManager should be 
reasonably efficient for small task queues, but less efficient for larger task 
queues.  We don’t have solid evidence that the task queues ever get large.  
Hence, the assertion that “TaskManager doesn’t scale” is meaningless.

No, it's not about scalability, it's about the window of time when a task is removed from 
the queue in TaskManager for execution but fails and needs to be retried later.  
Task.runAfter doesn't contain the task that "should have executed" so dependant 
tasks proceed before their depenencies.

This code comment from ServiceDiscoveryManager might help:

       /** This task class, when executed, first registers to receive
         *  ServiceEvents from the given ServiceRegistrar. If the registration
         *  process succeeds (no RemoteExceptions), it then executes the
         *  LookupTask to query the given ServiceRegistrar for a "snapshot"
         *  of its current state with respect to services that match the
         *  given template.
         *
         *  Note that the order of execution of the two tasks is important.
         *  That is, the LookupTask must be executed only after registration
         *  for events has completed. This is because when an entity registers
         *  with the event mechanism of a ServiceRegistrar, the entity will
         *  only receive notification of events that occur "in the future",
         *  after the registration is made. The entity will not receive events
         *  about changes to the state of the ServiceRegistrar that may have
         *  occurred before or during the registration process.
         *
         *  Thus, if the order of these tasks were reversed and the LookupTask
         *  were to be executed prior to the RegisterListenerTask, then the
         *  possibility exists for the occurrence of a change in the
         *  ServiceRegistrar's state between the time the LookupTask retrieves
         *  a snapshot of that state, and the time the event registration
         *  process has completed, resulting in an incorrect view of the
         *  current state of the ServiceRegistrar.

When do you claim that this happens?  And what currently happens now that is 
unacceptable?


It causes test failures caused by missing events for qa-refactor.

   What is the concrete, observable problem that you’re trying to solve, that 
justifies introducing failures that require further work?

The 2.2 codebase in it's current form is very brittle, small changescreate unrelated failures due to unfortunate timing. For that reason,it's very difficult to develop with. For example, Sim made somestraightforward changes that caused the trunk build to fail, thefailures were not caused by Sim's code, but by changes in timing.

So when you try to fix that, it causes another failure, and so on, it'sa snowball effect.

Realising I didn't have a wide enough test base to confirm stability(even though tests have been run on Solaris Sparc, Solaris x64, Win x64,Win 32, Linux x64, Linux ARM and OSX x64) I turned to static analysis(FindBugs) and manual code review to suppliment testing.

So when something looks susceptible to breaking, rather than wait for itto fail, or if FindBugs locates a race condition, I fix it, this doesn'tchange the functionality of the code when it's behaving as expected.

  If real usage never requires a large task queue, then scalability isn’t an 
issue, and we don’t know whether it ever needs a large task queue.

In any case, removing TaskManager and replacing it with hard-coded 
ThreadPoolExecutors moves us farther away from having the capability of a 
shared work queue.
   So I’m not in favour of this change.  I haven’t looked at the other services 
or utility classes, but if the changes are similar, I’m also not in favour.

No TaskManager instances have been replaced by ExecutorService, which is set 
via configuration.  The hard coded part is how to order tasks through the 
configuration provided ExecutorService.

In the 2.2 branch implementation of com.sun.jini.reggie.RegistrarImpl there’s a 
variable called ‘tasker’ that is not present in the qa_refactor branch’s 
implementation.

Yes, tasker has been replaced by two Executors, one for eventnotifications, the other for discovery responses, at present these arehardcoded.

307<http://svn.apache.org/viewvc/river/jtsk/skunk/qa_refactor/trunk/src/com/sun/jini/reggie/RegistrarImpl.java?annotate=1552606&diff_format=h#l307>private JoinManager joiner; private volatile JoinManagerjoiner; // accessed without lock from DestroyThread308<http://svn.apache.org/viewvc/river/jtsk/skunk/qa_refactor/trunk/src/com/sun/jini/reggie/RegistrarImpl.java?annotate=1552606&diff_format=h#l308>/** Task manager for sending events and discovery responses *//** Executors for sending events and discovery responses */309<http://svn.apache.org/viewvc/river/jtsk/skunk/qa_refactor/trunk/src/com/sun/jini/reggie/RegistrarImpl.java?annotate=1552606&diff_format=h#l309>private TaskManager tasker; private final ThreadPoolExecutoreventNotifierExec;310<http://svn.apache.org/viewvc/river/jtsk/skunk/qa_refactor/trunk/src/com/sun/jini/reggie/RegistrarImpl.java?annotate=1552606&diff_format=h#l310>private final ThreadPoolExecutor discoveryResponseExec;311<http://svn.apache.org/viewvc/river/jtsk/skunk/qa_refactor/trunk/src/com/sun/jini/reggie/RegistrarImpl.java?annotate=1552606&diff_format=h#l311>



This is how Mahalo does it:

205<http://svn.apache.org/viewvc/river/jtsk/skunk/qa_refactor/trunk/src/com/sun/jini/mahalo/TxnManagerImplInitializer.java?annotate=1555061&diff_format=h#l205>settlerpool = (TaskManager) Config.getNonNullEntry(config,TxnManager.MAHALO, "settlerPool", TaskManager.class, newTaskManager(settlerthreads, settlertimeout, settlerload));settlerpool = Config.getNonNullEntry(206<http://svn.apache.org/viewvc/river/jtsk/skunk/qa_refactor/trunk/src/com/sun/jini/mahalo/TxnManagerImplInitializer.java?annotate=1555061&diff_format=h#l206>taskpool = (TaskManager) Config.getNonNullEntry(config,TxnManager.MAHALO, "taskPool", TaskManager.class, newTaskManager(taskthreads, tasktimeout, taskload)); config,207<http://svn.apache.org/viewvc/river/jtsk/skunk/qa_refactor/trunk/src/com/sun/jini/mahalo/TxnManagerImplInitializer.java?annotate=1555061&diff_format=h#l207>TxnManager.MAHALO,208<http://svn.apache.org/viewvc/river/jtsk/skunk/qa_refactor/trunk/src/com/sun/jini/mahalo/TxnManagerImplInitializer.java?annotate=1555061&diff_format=h#l208>"settlerPool",209<http://svn.apache.org/viewvc/river/jtsk/skunk/qa_refactor/trunk/src/com/sun/jini/mahalo/TxnManagerImplInitializer.java?annotate=1555061&diff_format=h#l209>ExecutorService.class,210<http://svn.apache.org/viewvc/river/jtsk/skunk/qa_refactor/trunk/src/com/sun/jini/mahalo/TxnManagerImplInitializer.java?annotate=1555061&diff_format=h#l210>new ThreadPoolExecutor(211<http://svn.apache.org/viewvc/river/jtsk/skunk/qa_refactor/trunk/src/com/sun/jini/mahalo/TxnManagerImplInitializer.java?annotate=1555061&diff_format=h#l211>1,212<http://svn.apache.org/viewvc/river/jtsk/skunk/qa_refactor/trunk/src/com/sun/jini/mahalo/TxnManagerImplInitializer.java?annotate=1555061&diff_format=h#l212>settlerthreads,213<http://svn.apache.org/viewvc/river/jtsk/skunk/qa_refactor/trunk/src/com/sun/jini/mahalo/TxnManagerImplInitializer.java?annotate=1555061&diff_format=h#l213>settlertimeout,214<http://svn.apache.org/viewvc/river/jtsk/skunk/qa_refactor/trunk/src/com/sun/jini/mahalo/TxnManagerImplInitializer.java?annotate=1555061&diff_format=h#l214>TimeUnit.MILLISECONDS,215<http://svn.apache.org/viewvc/river/jtsk/skunk/qa_refactor/trunk/src/com/sun/jini/mahalo/TxnManagerImplInitializer.java?annotate=1555061&diff_format=h#l215>new LinkedBlockingQueue<Runnable>(),216<http://svn.apache.org/viewvc/river/jtsk/skunk/qa_refactor/trunk/src/com/sun/jini/mahalo/TxnManagerImplInitializer.java?annotate=1555061&diff_format=h#l216>new NamedThreadFactory("TxnMgr settlerPool", false)217<http://svn.apache.org/viewvc/river/jtsk/skunk/qa_refactor/trunk/src/com/sun/jini/mahalo/TxnManagerImplInitializer.java?annotate=1555061&diff_format=h#l217>)218<http://svn.apache.org/viewvc/river/jtsk/skunk/qa_refactor/trunk/src/com/sun/jini/mahalo/TxnManagerImplInitializer.java?annotate=1555061&diff_format=h#l218>);219<http://svn.apache.org/viewvc/river/jtsk/skunk/qa_refactor/trunk/src/com/sun/jini/mahalo/TxnManagerImplInitializer.java?annotate=1555061&diff_format=h#l219>taskpool = Config.getNonNullEntry(220<http://svn.apache.org/viewvc/river/jtsk/skunk/qa_refactor/trunk/src/com/sun/jini/mahalo/TxnManagerImplInitializer.java?annotate=1555061&diff_format=h#l220>config,221<http://svn.apache.org/viewvc/river/jtsk/skunk/qa_refactor/trunk/src/com/sun/jini/mahalo/TxnManagerImplInitializer.java?annotate=1555061&diff_format=h#l221>TxnManager.MAHALO,222<http://svn.apache.org/viewvc/river/jtsk/skunk/qa_refactor/trunk/src/com/sun/jini/mahalo/TxnManagerImplInitializer.java?annotate=1555061&diff_format=h#l222>"taskPool",223<http://svn.apache.org/viewvc/river/jtsk/skunk/qa_refactor/trunk/src/com/sun/jini/mahalo/TxnManagerImplInitializer.java?annotate=1555061&diff_format=h#l223>ExecutorService.class,224<http://svn.apache.org/viewvc/river/jtsk/skunk/qa_refactor/trunk/src/com/sun/jini/mahalo/TxnManagerImplInitializer.java?annotate=1555061&diff_format=h#l224>new ThreadPoolExecutor(225<http://svn.apache.org/viewvc/river/jtsk/skunk/qa_refactor/trunk/src/com/sun/jini/mahalo/TxnManagerImplInitializer.java?annotate=1555061&diff_format=h#l225>1,226<http://svn.apache.org/viewvc/river/jtsk/skunk/qa_refactor/trunk/src/com/sun/jini/mahalo/TxnManagerImplInitializer.java?annotate=1555061&diff_format=h#l226>taskthreads,227<http://svn.apache.org/viewvc/river/jtsk/skunk/qa_refactor/trunk/src/com/sun/jini/mahalo/TxnManagerImplInitializer.java?annotate=1555061&diff_format=h#l227>tasktimeout,228<http://svn.apache.org/viewvc/river/jtsk/skunk/qa_refactor/trunk/src/com/sun/jini/mahalo/TxnManagerImplInitializer.java?annotate=1555061&diff_format=h#l228>TimeUnit.MILLISECONDS,229<http://svn.apache.org/viewvc/river/jtsk/skunk/qa_refactor/trunk/src/com/sun/jini/mahalo/TxnManagerImplInitializer.java?annotate=1555061&diff_format=h#l229>new LinkedBlockingQueue<Runnable>(),230<http://svn.apache.org/viewvc/river/jtsk/skunk/qa_refactor/trunk/src/com/sun/jini/mahalo/TxnManagerImplInitializer.java?annotate=1555061&diff_format=h#l230>new NamedThreadFactory("TxnMgr taskPool", false)231<http://svn.apache.org/viewvc/river/jtsk/skunk/qa_refactor/trunk/src/com/sun/jini/mahalo/TxnManagerImplInitializer.java?annotate=1555061&diff_format=h#l231>)232<http://svn.apache.org/viewvc/river/jtsk/skunk/qa_refactor/trunk/src/com/sun/jini/mahalo/TxnManagerImplInitializer.java?annotate=1555061&diff_format=h#l232>);233<http://svn.apache.org/viewvc/river/jtsk/skunk/qa_refactor/trunk/src/com/sun/jini/mahalo/TxnManagerImplInitializer.java?annotate=1555061&diff_format=h#l233>

One option is to stop worrying about event order at the sender, and figure out 
a way of ordering at the recepient.

I don’t understand what you’re saying here.

Task.runAfter dependencies are performed at the sender, even if orderedcorrectly in TaskManager, there's no telling which thread from the poolwill be scheduled by the operating system, this can alter taskprocessing order, network latency can also dictate out of order arrivalat the recepient.

   You’re introducing changes that introduce test failures (which is why you’re 
asking for help) without a good reason.

The reason is to expose synchronization bugs so they can be observed and fixed.

Randomly changing code is not a rational diagnostic technique.

Which changes do you feel are random? Feel free to ask about why I'vemade a particular change, but try to keep questions short, so they'remanagable for me.

  You’re never going to ship this code unless you stop modifying it.

It is a considerable undertaking, but I'm not in any hurry, it's not yet ready 
for release.  I understand you're concerned about the considerable number of 
changes; there will be plenty of time for compatibility testing.

Also, when you say below,

I'm developing an ExecutorService wrapper that retry's failed tasks in 
org.apache.river.impl.thread.SerialExecutorService, by not removing a task from 
it's queue until it completes successfully, it prevents any dependant tasks 
from running, I would like to use this as a replacement for TaskManager and 
RetryTask.

…be careful!  You’re getting into the same difficult area as transactional 
semantics around messaging.  Will you need to provide a “dead task” queue?  Do 
you need to set a limit on how many times a task get retried?  What happens 
when that limit is exceeded?  Do all tasks have the same limit?  Should a task 
get notified when it’s exceeded the retry limit?  How long should you wait 
between retries?  Is that number the same for all tasks.  Is there some kind of 
alarm or notification when tasks end up being retried, or when the dead task 
queue becomes full?

Since most dependencies appear to be based on who the recipient is, if that 
recepient is not contactable in spite of considerable effort, we should abandon 
futher attempts to do so.   At present RetryTask will continue to attempt to 
make contact every 5 minutes.

I don’t understand what you’re saying here.

RetryTask is required because of unreliable network connections, if wesend a sequence of events to a remote host, we are currently trying tosend them in order. If one task fails briefly, then the events willlikely arrive out of order at the recipient when the failed task islater retried.

Regards,

Peter.

Re: Build failed in Jenkins: river-qa-refactor-win #45

Reply via email to