Amichai Rothman created DOSGI-173:
-------------------------------------

             Summary: unregistering an exported service does not remove it from 
zookeeper (and remote clients)
                 Key: DOSGI-173
                 URL: https://issues.apache.org/jira/browse/DOSGI-173
             Project: CXF Distributed OSGi
          Issue Type: Bug
    Affects Versions: 1.5
            Reporter: Amichai Rothman


I have some bundles exporting and consuming services, running on two hosts. 
I've noticed more than once that while stopping and starting different bundles 
on the two hosts (just playing around with them manually to see how robust the 
distributed system is), at some point one of the hosts doesn't see that a 
service it was using from the other host is down. Connecting to ZooKeeper 
directly, I see the node for that service is still there, i.e. the service was 
not properly removed from ZK even though the bundle is stopped and service is 
gone.

Investigating this is a bit tricky, since it involves various trackers, 
endpoint listeners and service listeners and there is not enough code 
documentation to understand what the intended flow is... however I've found a 
few interesting related findings that may point at the solution:

1. Following the logs and some debugging, it appears that the problem is not 
with the discovery.zookeeper package/bundle itself, since the endpoint removed 
event never gets there.

2. In EndpointListenerNotifier.notifyListenersOfRemoval(), the 
EndpointDescription appears to be null, so there is never a filter match and 
the endpointRemoved callback is never triggered on the EndpointListeners. This 
is because all of the ExportRegistrations are already closed by the time they 
get there. It seems that the premature closing is done by the service tracker 
created in ExportRegistrationImpl.startServiceTracker(). My guess is that the 
order in which the service tracker and service listener (in 
TopologyManagerExport, which triggers the EndpointListenerNotifier) receive the 
events is arbitrary depending on some race condition somewhere, which may 
explain why this is an inconsistently reproducible bug. I would like to say 
that the solution is to get rid of the service tracker altogether (it doesn't 
do anything else, and as a separate bug, is never closed), but I'm not sure why 
it was introduced in the first place or if there are any other scenarios in 
which it was necessary, so I really don't know what the proper solution should 
be.

3. Another element that may have been masking this bug to some degree is the 
local discovery bundle which was running, and during debugging I saw it 
triggering some EndpointListener removal events which were picked up by the 
other components. I'm not entirely sure yet of what this bundle does (I didn't 
find any mention of it on the website, and didn't get to the code yet), but I 
just leave this bundle in the stopped state for now, with no visible effects on 
the testing, making debugging easier.

4. An additional related issue which bugged me during a previous code review 
was that InterfaceMonitorManager.addInterest() is closing and recreating an 
InterfaceMonitor every time it is invoked with an existing scope, even though 
the old and new IMs monitor the same ZK node and are practically identical - so 
why not just leave the old monitor running? This replacement causes a bunch of 
unnecessary extra work (including several ZK server accesses), a flurry of 
unnecessary filter-matching logs, and and unnecessary gap in monitoring for ZK 
changes. This also relates to the bug at hand since InterfaceMonitor.close() 
also sends some EndpointListener notifications about the endpoints being 
removed, which leaves some gaps in the registration coverage (before they are 
re-added moments later) and might interact in some other unpredictable (at 
least to me) way with the rest of the mechanism. It seems these IM close/start 
cycles sometimes occur tens of times in a row.

To sum it up, there's definitely a bug occurring. When I tested a bit with 
fixes for both potential causes above (IM stop/start replaced with a single 
start the first time a given scope is encountered, and close invocation in 
service tracker removed) - I could no longer recreate the bug, but I don't 
understand all the component interactions well enough to know if there are any 
side effects, or why they were implemented this way in the first place (I tend 
to assume there was a good reason for it which I'm unaware of).


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to