Amichai Rothman created DOSGI-173:
-------------------------------------
Summary: unregistering an exported service does not remove it from
zookeeper (and remote clients)
Key: DOSGI-173
URL: https://issues.apache.org/jira/browse/DOSGI-173
Project: CXF Distributed OSGi
Issue Type: Bug
Affects Versions: 1.5
Reporter: Amichai Rothman
I have some bundles exporting and consuming services, running on two hosts.
I've noticed more than once that while stopping and starting different bundles
on the two hosts (just playing around with them manually to see how robust the
distributed system is), at some point one of the hosts doesn't see that a
service it was using from the other host is down. Connecting to ZooKeeper
directly, I see the node for that service is still there, i.e. the service was
not properly removed from ZK even though the bundle is stopped and service is
gone.
Investigating this is a bit tricky, since it involves various trackers,
endpoint listeners and service listeners and there is not enough code
documentation to understand what the intended flow is... however I've found a
few interesting related findings that may point at the solution:
1. Following the logs and some debugging, it appears that the problem is not
with the discovery.zookeeper package/bundle itself, since the endpoint removed
event never gets there.
2. In EndpointListenerNotifier.notifyListenersOfRemoval(), the
EndpointDescription appears to be null, so there is never a filter match and
the endpointRemoved callback is never triggered on the EndpointListeners. This
is because all of the ExportRegistrations are already closed by the time they
get there. It seems that the premature closing is done by the service tracker
created in ExportRegistrationImpl.startServiceTracker(). My guess is that the
order in which the service tracker and service listener (in
TopologyManagerExport, which triggers the EndpointListenerNotifier) receive the
events is arbitrary depending on some race condition somewhere, which may
explain why this is an inconsistently reproducible bug. I would like to say
that the solution is to get rid of the service tracker altogether (it doesn't
do anything else, and as a separate bug, is never closed), but I'm not sure why
it was introduced in the first place or if there are any other scenarios in
which it was necessary, so I really don't know what the proper solution should
be.
3. Another element that may have been masking this bug to some degree is the
local discovery bundle which was running, and during debugging I saw it
triggering some EndpointListener removal events which were picked up by the
other components. I'm not entirely sure yet of what this bundle does (I didn't
find any mention of it on the website, and didn't get to the code yet), but I
just leave this bundle in the stopped state for now, with no visible effects on
the testing, making debugging easier.
4. An additional related issue which bugged me during a previous code review
was that InterfaceMonitorManager.addInterest() is closing and recreating an
InterfaceMonitor every time it is invoked with an existing scope, even though
the old and new IMs monitor the same ZK node and are practically identical - so
why not just leave the old monitor running? This replacement causes a bunch of
unnecessary extra work (including several ZK server accesses), a flurry of
unnecessary filter-matching logs, and and unnecessary gap in monitoring for ZK
changes. This also relates to the bug at hand since InterfaceMonitor.close()
also sends some EndpointListener notifications about the endpoints being
removed, which leaves some gaps in the registration coverage (before they are
re-added moments later) and might interact in some other unpredictable (at
least to me) way with the rest of the mechanism. It seems these IM close/start
cycles sometimes occur tens of times in a row.
To sum it up, there's definitely a bug occurring. When I tested a bit with
fixes for both potential causes above (IM stop/start replaced with a single
start the first time a given scope is encountered, and close invocation in
service tracker removed) - I could no longer recreate the bug, but I don't
understand all the component interactions well enough to know if there are any
side effects, or why they were implemented this way in the first place (I tend
to assume there was a good reason for it which I'm unaware of).
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira