Re: [networking-discuss] Reminder: You are invited to the design review of project Volo

James Carlson Fri, 26 Sep 2008 14:05:07 -0700

Rao Shoaib writes:
> > Section 2.2 says:
> >
> >   "An up call/down call should never result in a up/down call in the
> >    same thread context."
> >
> >   I'm a little unclear on the 'should' part of this statement.  Is it
> >   intended to mean 'shall' or are there exceptions to the statement?
> >
> >   I don't think I see cases where this ought to occur as a matter of
> >   design, so am I missing something?
> >   
> Actually this statement is not correct, There are two cases where an 
> upcall can result in a downcall.
> 1) In case of fallback
> 2) If the protocol gets flow controlled on the send, the thread making 
> the downcall makes the upcall to set the TxQ full flag.


If reentry is possible, then I'd recommend:

  - Document clearly which callbacks can renter (and which cannot),
    and which reentry points are valid.

  - Be as strict as you can.  It's always possible to add new
    permissions to an interface, it's almost never possible to revoke
    ones we've given, even when we later discover that they're
    problematic.

Tread carefully; this is an area where we've really hurt ourselves in
the past by recursing to blow stacks and getting tangled up in knots
over locks.  (A good example of how bad it can get is SQS_REENTER.)

> > Section 2.7:
> >
> >   In addition to "filtering," I expect that there will be layered
> >   modules, so we should be planning for them.  "Sockets Direct" would
> >   be one example of a layered model.  NL7C and SSL are probably other
> >   examples of these.  (The actual implementation probably looks a lot
> >   like "filtering," so perhaps this is just a nit.)
> >   
> What terminology should we use here instead.

"Filtering" just seemed like a very specific term to use, with certain
connotations about functionality.  Maybe giving examples of what could
be termed a "filter" (would NL7C be one?) would help.

Otherwise, explaining that you intend to allow for Volo plug-ins that
can in turn rely on other Volo plug-ins would do it.

> > Section 3.2.1:
> >
> >   "If a module name is present, checks are made to ensure that the
> >    module exists and can be loaded."
> >
> >   Given that the rest of the design supports dynamic loading, I
> >   suggest considering not doing this test here, but instead do it when
> >   a socket is opened, and the module is actually needed.  That way,
> >   designers could put new socket types in "/usr/kernel/sockmods" and
> >   avoid burdening the root file system with more modules.  (I suspect
> >   you already have to do that dynamic existence check *anyway*, as
> >   it's part of dynamic loading.)
> >   
> Isn't usr already mounted when soconfig runs

No.

It runs from /etc/inittab as the system is booting.  It runs even
before svc.startd runs, which is what starts up SMF and (in turn)
causes the scripts that mount /usr to run.

That's why soconfig is in /sbin.

> > Section 3.4.1:
> >
> >   Depending on how many of these there are, and how frequent
> >   create/destroy is done, a modhash might be useful.
> >   
> We will file an RFE.

How often do applications open sockets?  Examining performance here
would amount to an RFE if it were an infrequent operation and one that
wasn't associated with specific performance tests (like web server
connection counts), but I don't think that's the case here.

> > Section 3.6:
> >
> >   Once unload has actually started to run, I suspect that something
> >   will need to prevent new calls to the module from occurring until it
> >   can be loaded again.  Unload should start the _fini process, and at
> >   some point (particularly after mod_remove has been called), this
> >   becomes destructive.
> >
> >   How does TCP's time-wait state work here?  It needs to hang around
> >   for a while after the socket is actually closed.  What stops the
> >   module from unloading in this case?  (Does the transport have to
> >   handle this?)
> >   
> We had not thought about this. Currently tcp and sockmod for tcp are 
> separate modules so we only unload the socket module. However they could 
> be same so we need to handle this situation. We will handle that by the 
> protocol returning EAGAIN or EBUSY. In case of EAGAIN unload will be 
> tried at a later time. EBUSY would mean that the module is busy.

OK, but I don't think I understand the distinction between the two.
If a module can't be unloaded right now, isn't it busy?  And if it's
busy, it can't be unloaded, can it?  I would have expected that those
are the same state.

> > Section 4.4.3:
> >
> >   What context is required or provided when the downcalls are made?
> >   Is it always a process context, or could it be a kernel thread or
> >   even an interrupt?
> >   
> It can never be an interrupt conetxt, It will be either process or 
> kernel thread context.

OK; that needs to be documented.

> > Section 4.4.3.13:
> >
> >   How do socket types that require synchronous behavior on
> >   read(2)/write(2) work (such as with PF_ROUTE)?
> >   
> write(2) is blocked in the protocol.  Nothing special is done for read(2).

By "nothing special," I think you mean to say that it's assumed that
read(2) is async, so you block internally (without calling the
protocol code) if nothing is in the queue, and you expect the protocol
to make upcalls to deliver data to be queued for the application.

That's probably ok for now; I have to admit that I'm not really sure.

> >   I don't think an upcall to stop flow is a good idea because it
> >   potentially allows for priority-related problems where the sender
> >   pounds data down into a transport layer that never gets a chance to
> >   say "uncle."
> >   
> This mechanism allows us to not copy data if the protocol is flow 
> controlled,

I don't understand why you'd _ever_ have to copy any data.

I'm asking you to return (synchronously; as a return value on a write
attempt) an "I'm full" error from the protocol, and then later do an
upcall from the protocol/transport layer to reenable once space
becomes available.  That scheme doesn't require any data copy at all.

> It also allows us to co-ordinate blocking and waking up of a 
> thread. We can address your comment by allowing the protocol to return 
> EAGAIN and have the socket code block or return the error to the user. 
> BTW this will be a change in behavior from current Solaris implementation.

I don't quite understand what you're saying here.

The "current Solaris implementation" is STREAMS, which uses
canputnext() to check whether the next level down is in a flow control
condition.  That's logically equivalent to returning an "I'm full"
error on an attempted write -- you can transform one scheme to the
other.

My point here is that relying on upcalls to stop output is a dangerous
concept.  It means that in a moment when the system is very busy, the
protocol down below somehow has to gain control of a CPU and say
"stop," or it will be overrun with data, with no clear way to stop it.
It invites error, and it's just safer to avoid this case.

> > Section 4.4.3.16:
> >
> >   What's the difference between *rvalp and the actual return value
> >   from sd_ioctl?
> >   
> The return value is zero or errno.

Could you please document some of the usage cases to make this
clearer?  This document (and not necessarily the source code) will be
what we give to protocol implementors in order to tell them how to use
this interface; it needs to be clear.

What would I do in the case where I'm returning success?  At a guess,
I'd set *rvalp to anything other than -1, and return 0.  For error,
I'd set *rvalp to -1 and return a non-zero value to be returned to the
user as errno.  Is that right?

> > Section 4.4.3.17:
> >
> >   Does sd_close block?  If not, then how does the user request a
> >   blocking close or wait for close to complete?
> >   
> It blocks.

OK; that needs to be in the documentation.

> > Section 4.4.4:
> >
> >   The text says "although not recommended" many times in reference to
> >   locks.  Is it possible to make a definite design-level statement
> >   about the holding of locks across the API calls?  I'd strongly
> >   prefer to have the text say "must not."
> >
> >   It's always easier to make things more generous in the future than
> >   it is to try to tighten up the interface when we discover that we've
> >   made a mistake.
> >
> >   Don't be generous in version 1.0.
> >   
> We have to hold locks or be in the squeue for example to maintain 
> ordering of packets.

That's extremely dangerous.  It means that if someone writes a
"filter" that ends up calling downwards on an upcall (say for example
a "TCP splice" module), then we can run into deadlock.

If it's at all possible, it would be a good idea to find a way to
preserve data ordering without allowing locks to be held across calls
that may transfer control through multiple layers.

> >   In what context can these calls be made?  Can any of them be made in
> >   interrupt context?
> >   
> Yes they can be, for example when queuing a packet.

OK; document.

> > Section 4.4.4.3:
> >
> >   I don't understand the non-zero return value for su_disconnected.
> >   It says that it's "safe" to release the control block at that
> >   point.  _Must_ the transport behave as though sd_close had been
> >   called at this point, or is it still possible for the upper level to
> >   call sd_close on that same socket?
> >   
> 
> The main reason is SCTP, which needs it when dealing with associations. 
> Associations are removed as soon as they are disconnected. For other 
> connection oriented protocols this could be used to get rid of sockets 
> that are disconnected before they are pulled off of the accept queue, 
> but that would just be an optimization. And yes, if a non-zero value is 
> returned a sd_close() would not happen, so the transport would have 
> behave like a sd_close() was received. But we need to think about this a 
> bit more, we might just say that zero will always be returned (for 
> version 1).

OK.

> > Section 4.4.4.7:
> >
> >   This isn't symmetric with the read side, and it should be.  On the
> >   read side, there's just a downcall to say "more space is now
> >   available."  The same should be true for the send side.
> >   
> On the read side we return the amount of space left in the recv buffer, 
> We are hoping that the protocol will do something intelligent with that 
> info, like adjust it's window size etc.
> 
> For send we either accept the whole packet or accept the whole packet so 
> it seems unnecessary to return the amount of space left.

I don't think I understand that answer, because I wasn't talking about
byte counts at all.

On the receive (read) side of this interface, there's a downcall that
tells the transport layer "more buffer space is now available."  If
the transport had the window closed or was discarding data or
asserting flow control downwards, it could release that condition
based on that call.

A similar design would work well on the transmit (write) side of the
interface.  It would work well to have the upper level call the
transport to send data until flow control occurs and the transport
returns error.  The transport would then call back later to the upper
level to say "I now have more space."

That would be a symmetric design.  The current design isn't symmetric;
it uses very different mechanisms for flow control on the transmit and
receive sides, and I think that's confusing and hard to use.  Flow
control is hard to get right in general for most implementors, and I
think it'd be better to have symmetry here.

> > Section 4.4.4.11:
> >
> >   What does this do?  What are "transient errors," who receives them,
> >   and when?  (I assume this is for handling ICMP errors in UDP ... but
> >   more about the behavior of the interface is needed.)
> >
> >   
> Your guess is correct, it sets so_error in the socket that is consumed 
> by the next operation.

The specific semantics need to be documented.

> > Section 8.2.1:
> >
> >   Why does ksock_cb_flags need to exist?
> >   
> So that we can update individual callbacks.

That doesn't quite answer my question.  Why do you need to "update"
individual callbacks?  Why wouldn't callbacks be set exactly once and
that'd be it?  What's the usage case?

Please tell me that we're not planning to play function pointer games
here as well.  :-<

> > Section 8.4:
> >
> >   Is there anything in principle that would prevent this from being
> >   usable to implement AF_UNIX?
> >   
> Only because AF_UNIX socket implementation is TPI based, we do not 
> support TPI based kernel sockets

I'm talking about the AF_UNIX socket semantics -- the behavior -- not
the internal implementation.

OK; it sounds like it _could_ be used, but just isn't right now.

-- 
James Carlson, Solaris Networking              <[EMAIL PROTECTED]>
Sun Microsystems / 35 Network Drive        71.232W   Vox +1 781 442 2084
MS UBUR02-212 / Burlington MA 01803-2757   42.496N   Fax +1 781 442 1677
_______________________________________________
networking-discuss mailing list
[email protected]

Re: [networking-discuss] Reminder: You are invited to the design review of project Volo

Reply via email to