Kurt Westerfeld created KARAF-6224:
--------------------------------------

             Summary: Race condition in BaseActivator on first launch
                 Key: KARAF-6224
                 URL: https://issues.apache.org/jira/browse/KARAF-6224
             Project: Karaf
          Issue Type: Bug
          Components: karaf
    Affects Versions: 4.2.4, 4.1.7, 4.0.10
            Reporter: Kurt Westerfeld


We have several karaf containers we run on single machine that contains a large 
number of cores (20).  The machine core count is high so this may be a hard 
problem to reproduce.  We have customized the RMI and JMX ports for each of the 
containers so that they do not conflict.  However, after the first karaf VM is 
launched and claims ports 1099/44444, the second VM will attempt to do the same 
briefly before its customized configuration can be read from the ${karaf.etc} 
directory.   You can see that the management bundle gets started and then a 
configuration update will happen immediately with the corrected values.

In looking over BaseActivator, it seems that a thread is created to dispatch 
the initialization and sometimes this thread will encounter a null field 
"config" before the asynchronous managed service event arrives.  In this case, 
the configuration is missing and defaults will be used.  Because of this, ports 
1099 and 44444 are temporarily attempted to be used until the first managed 
service event arrives with the updated() method.   Immediately after that, the 
service reconfigures and uses the proper customized values.

This is a problem for us because at times this temporary event can cause a 
client to mistakenly connect to the wrong container.  We use JMX over RMI to 
perform a number of management operations and this initial startup is 
unreliable.  Our three karaf containers have some interdependencies that this 
temporary condition is causing problems with.

This problem does not occur as often on subsequent restarts, which means that 
initial provisioning of the ${karaf.etc} must be racing here.  We have seen it 
happen, however, although rarer, at any time.  It is believed that the high 
core count of the server this happens to be running on results in the race 
condition.

Suggested fix is to make a call to config admin at run() to read the 
configuration if this.config is null.  This would handle the race here but it 
could cause other bad interactions with config admin?  Not sure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to