Re: [Openslp-devel] Problem with uncontrolled loss of DAs

Morrell Richard Thu, 20 Sep 2007 04:25:21 -0700

Thanks for the feedback.
 
Ideally, we would like to detect invalid DAs within a minute or so.  I know
that sounds ambitious, but our base networking is Gigabit ethernet, our
backbone is 10Gbit, and our systems have less than 30 DAs (although we have
a couple of hundred "slave" SAs), so the additional load of querying for
option B is acceptable (option A has the same number of replies, but fewer
request messages as it would use multicast).  We use very aggressive
timeouts, and our searches usually complete anyway within a few milliseconds
(or a few tens of milliseconds in the worst case).
 
In option A, I had originally thought that the normal multicast algorithm
was used, in which case subsequent messages would include an increasing
responders list.  However, on more detailed inspection of the code, I
realise that active discovery is a single-shot multicast send with no
retransmits, so it is not really suitable for our purposes.
 
I'm not clear how your alternate suggestion would work.  My understanding
from looking at the code is that the library sends registration requests
only to the local SLP daemon (although the comments for NetworkConnectToSA
suggest that the cached socket, handle->sasock, can be connected directly to
a DA/SA, the only place that I can find that handle->sasock is set up is
from a call to NetworkConnectToSlpd), and as I understand it, the local SLP
daemon will reject registrations that are not in its scope.
 
I have also had a thought about an option C.  If each DA periodically sent
out a multicast DAAdvert (a heartbeat) at a rate of, say, three times the
inactive DA check rate, the daemon could remember whether it had received an
advert for each known DA since the last check, and remove those DAs it
hadn't heard from.  This would involve the least additional traffic, and
probably the least additional load on the daemons.  The risk of losing a DA
incorrectly would be higher than the risk from the unicast option (which
does up to five retransmits), or similar if we used a rate of five times the
check rate.  This mechanism would also be likely to recover from an
incorrect removal more quickly than option B, which would have to rely on
active DA discovery to re-find the DA (although this could probably be
tuned).  The periodic sending would be dependent on having a configured
check period, which would default to disabled.
 
Thinking about it, this option seems to have a lot going for it.  What do
you think ?
 
--Richard

-----Original Message-----
From: Nick Wagner [mailto:[EMAIL PROTECTED]
Sent: 19 September 2007 17:47
To: Morrell Richard
Cc: [email protected]
Subject: Re: [Openslp-devel] Problem with uncontrolled loss of DAs

In my systems everyone is on the same scope, so I haven't run into this
problem (and why I would prefer that any added mechanism would be disabled
by default).  I'm a little curious as to how often multiple scopes are
actually used, and are used in the same manner as your system.  What kind of
time period do you need to detect invalid DAs in? 

You are correct that the issue here is not just a FindScopes one, it's the
fact that DAs don't expire in slpd.  I ran into the same issue when moving
slpd unicast to UDP, which is why I added the timeout on the service
registration (following the protocol, of course :).  If FindScopes were a
protocol-level command, I'd suggest a similar solution, but FindScopes just
queries the internal database as given to libslp by the connected slpd.  

As an alternate to either option, the app could register a fake registration
on each scope it knows about through a previous SLPFindScopes, which should
help keep the knownDAs in sync.  And openslp is not changed.  If multiple
scopes aren't widely used, or used in the way you use it, this may be the
preferred option.  Or it could act as a quick proof of concept. 

I'm a little worried about removing the answer suppression in option A.  You
are never guaranteed to receive a response from a DA in a particular
FindSrvs, and if there are a lot of DAs on the system the likelihood of
seeing that DA could decrease because you are processing all the DA adverts
each request.  And I'm assuming you have some sort of list of potential
drops and not just drop if a DA doesn't respond to one request. 

Option B has potential.  Slpd could periodically do a unicast request and
time out in the same way that registration requests currently time out.  If
this period wasn't too small there wouldn't be that much of an impact on the
system (and if disabled by default would be even less :).  I think of the
two options I prefer this one. 

Just my two cents.

--Nick

On 9/19/07, Morrell Richard < [EMAIL PROTECTED]
<mailto:[EMAIL PROTECTED]> > wrote: 

I have a problem with uncontrolled loss of DAs  ie. where DAs can drop off
the network without sending out a corresponding DA advert, such as power 
loss, or network device failure.

All the DAs in our system have unique scopes, and we perform unicast
searches of each scope (I have a patch to the 1.2.1 library that does
parallel unicast to multiple DAs, which I haven't yet had time to port to 
the latest trunk for submission).

We get the list of scopes using the SLPFindScopes call, which queries the
local daemon.  The problem is that when a DA goes down in an uncontrolled
fashion, its scope never seems to get removed from the list of scopes, so we

get timeouts for all subsequent searches until the DA comes back up again,
which is unacceptable in our application (we can cope with a short period
where this occurs, provided the situation is not permanent).

We have tried setting the active DA discovery parameters to their most
aggressive, in the hope that this would flag up the lost DAs, but this makes
no difference.

I have looked at the code, both version 1.2.1 which we are using, and the
latest trunk, and I believe the problem is in both, and arises because the
active DA discovery only adds new DAs to the DA cache, and does not remove
them.  DAs ARE removed from the cache if a unicast request to a DA fails, 
but these seem to be related only to service registration and deregistration
and, since each DA has a unique scope, there is no requirement to perform
these operations between DAs.

The two approaches I was considering were 

a) Change the active discovery mechanism to query for all DAs (use an empty
previous responders list), and construct a list of those known DAs that
don't reply, removing them from the cache after a time.  This behaviour 
could be enabled/disabled using a new property.

b) Perform a regular unicast (DA request?) to each of the known DAs eg. on a
round robin basis so that all DAs are polled within a time period controlled
by a new property (could be set to zero to disable this behaviour) 

Obviously, I would like to feed any changes back into the project, so I am
looking for feedback as to which approach would be preferable, or if there
is another approach that would be better, or if someone else was working on 
the problem already.

Thanks.

Richard Morrell

Software Architecture & Technologies
THALES UNDERWATER SYSTEMS LTD

This email, including any attachment, is a confidential communication 
intended solely for the use of the individual or entity to whom it is
addressed. It contains information which is private and may be proprietary
or covered by legal professional privilege. If you have received this email 
in error, please notify the sender upon receipt, and immediately delete it
from your system.

Anything contained in this email that is not connected with the businesses
of this company is neither endorsed by nor is the liability of this company.

Whilst we have taken reasonable precautions to ensure that any attachment to
this email has been swept for viruses, we cannot accept liability for any
damage sustained as a result of software viruses, and would advise that you 
carry out your own virus checks before opening any attachment.

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005. 
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
<http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/> 
_______________________________________________
Openslp-devel mailing list
[email protected]
<mailto:[email protected]> 
https://lists.sourceforge.net/lists/listinfo/openslp-devel
<https://lists.sourceforge.net/lists/listinfo/openslp-devel> 

This email, including any attachment, is a confidential communication
intended solely for the use of the individual or entity to whom it is
addressed. It contains information which is private and may be proprietary
or covered by legal professional privilege. If you have received this email
in error, please notify the sender upon receipt, and immediately delete it
from your system.

Anything contained in this email that is not connected with the businesses
of this company is neither endorsed by nor is the liability of this company.

Whilst we have taken reasonable precautions to ensure that any attachment to
this email has been swept for viruses, we cannot accept liability for any
damage sustained as a result of software viruses, and would advise that you
carry out your own virus checks before opening any attachment.

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/

_______________________________________________
Openslp-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/openslp-devel

Re: [Openslp-devel] Problem with uncontrolled loss of DAs

Reply via email to