Re: Reliability of RPC services

Jonathan S. Shapiro Fri, 21 Apr 2006 17:16:56 -0700

Marcus raises a good point, but he is missing some history.

In KeyKOS, the "resume capability" served the role of a reply
capability. When a node containing a resume capability was destroyed (by
the space bank), a well-defined, distinguished message was sent to the
recipient for all resume capabilities contained in that node.


In EROS, this behavior was dropped, and in my opinion that was a
mistake.

In Coyotos, there is no way to distinguish resume capabilities from
entry capabilities (or at least, not at the moment) so it is difficult
to duplicate the KeyKOS behavior at the moment, but see below.

In any persistent system, "notify on last capability drop" is
impractical. It requires disk garbage collection, so the delay is too
long to be helpful.

It would not be difficult to add a bit to an FCRB capability to support
this. We could call it the "invoke on delete" bit.

Here is the meaning of this bit:

  On destruction of any object, the capability slots are examined.
  For any slot that contains an "invoke on delete" FCRB sender
  capability, a non-blocking message will be sent indicating that
  the capability was held within a deleted object at the time of
  deletion.

  If sending this message would block, it will not be delivered.

  If the FCRB sender capability is invalid, no message will be sent.

HOWEVER:

This message does NOT mean that all capabilities to the FCRB are gone.
It means that *some* object containing the capability has been
destroyed. If there are multiple copies of the capability in different
objects, and one of these objects is destroyed, the message will be
sent. Programs can take deliberate steps to suppress this behavior, but
this would be the normal outcome.

This is not quite the semantics that Marcus is after, but in practice it
was good enough in KeyKOS.

If this is sufficiently helpful to justify revising the Coyotos spec,
please send a note to coyotos-dev confirming that this update should be
made.

> * Whatever user program destroys the failed server process D, also
>   takes care of the users of the process D.  This solution requires
>   significant structural overhead, and creates undesirable strong
>   dependency structures in the system (for example, global managers).

This solution is impossible. The storage containing those capabilities
is gone, and the party who destroys that storage usually does not have
access to the content of the storage. In particular, the authority to
destroy a space bank specifically does NOT include the authority to
inspect storage that has been allocated by that bank. This is absolutely
essential for confinement and *any* security policy.

> * The program S could use timeouts in the call to D.  This solution
>   requires significant structural changes to the system design,
>   because now time becomes an important parameter in evaluating
>   services.  It can be tried to argue that this is desirable anyway.

This solution leads directly to systems that fail under load. There is
general agreement in both the L4 and EROS/Coyotos communities that
generalized timeouts were a mistake, and that "forever" and "don't wait"
are the only options that should be implemented by the IPC layer.

> * Following Mach, special "send-once" capabilities are introduced that
>   implement the send-once semantics.  Here are the semantics expressed
>   in terms of Coyotos: When copied, the source capability is
>   invalidated (so the number of send-once capabilities to a given
>   object is a system invariant under capability copy operations). 

The semantics of send-once rights is an abomination. The cost of them is
considerable, and the overhead of manipulating them correctly from the
application perspective is a serious problem. Coyotos will not under any
circumstances implement "send-once" or "grant-only" capabilities.

>   This has the disadvantage that it makes task destruction somewhat
>   more expensive...

Worse, it has the disadvantage that every capability copy must be
preceded by a capability type check, so that the sender knows whether it
is losing the capability as a side effect. This violates encapsulation
in a fairly fundamental way.

> 1) Is RPC robustness desirable/required, or is an alternative model
>    feasible where machine-local RPC is as unreliable as IP/UDP network
>    communication?

Yes, it is important.

> 2) If it is indeed desirable, are there more possible solutions than
>    the three approaches described above?

Yes. "Invoke on destroy of containing object".

> 3) Are the costs of destroying send-once rights (and thus sending
>    messages) acceptable?  Given a positive answer to 1, and a negative
>    answer to 2, are these costs in fact inavoidable?

The costs are high and the semantics are horrible.

> 4) In fact, if we consider persistence, can not the same mechanism
>    above that was described to help with malicious or buggy software
>    be used to deal with the planned and desired removal of device
>    driver servers from the system at reboot of the persistent machine?

This is a completely different problem, and it is handled by the normal
revocation logic.

>    IOW: As far as I understand, EROS had a logic to restart RPCs that
>    were pending and which were sent across the boundary between the
>    persistent and the non-persistent world.

It did not. The design called for a mechanism by which both parties
would discover that the RPC was incomplete because the communication
channel had been destroyed.

shap



_______________________________________________
L4-hurd mailing list
L4-hurd@gnu.org
http://lists.gnu.org/mailman/listinfo/l4-hurd

Re: Reliability of RPC services

Reply via email to