> > > On the other hand trying to hook offloaded iWARP into the normal stack > > > does seem to lead to a mess. I see DaveM's point: TCP port space is > > > just the beginning -- filtering, queueing, etc also have config that > > > ultimately an offload device would want to hook too. > > > TCP port space is just the beginning but then these features > > didn't show up all at once in the kernel either. Instead of > > evolving iWARP implementation, we can't even take a baby step > > and fix a flaw that exists in the current kernel. Why are we > > "replicating" everything offered by the host stack instead of > > hooking in? It does not sound like good engineering to me. > > Well as I said I don't particularly see a clean solution. But the point > I was making was that the net stack is already very complex with many > places where interface configs are controlled -- having to add hooks to > pass that config on to offload devices is going to add even more > complexity and also add constraints to the format of that config > information. Which is not good. > To my understanding, our discussion touches two topics. One is to solve the TCP port space issue, the other is more general, its about proper integration of offloaded TCP within Linux. So, the second topic is a generalization of the first.
Regarding the first topic, what I was about to propose is that the iWARP kernel driver (software iWARP or RNIC) itself should take care of port space allocations. Port space maintenance functionality should be minimized at iWARP CM level. It looks straightforward to me if during the rdma_connect() call the driver picks a free port using a socket/bind sequence for its local interface. The same would be possible for the passive connection setup, which always involves an rdma_bind_addr() - we would have to pass the rdma_bind_addr() call down to the driver and EADDRINUSE would be a reasonable return value. Here things are getting a little more complicated, if it comes to INADDR_ANY and port 0 bindings. In private email, Bob Sharp already suggested it - the iWARP CM would have to pick a port and try it on all interfaces....maybe by starting with port 0 binding on one interface and trying to extend with the returned port on all remaining interfaces. That introduces an unbind() call if things fail, too. In any case, the rdma_bind_addr() call would create additional state at driver level. For softiwarp, during bind() or connect(), a TCP socket would be created and bound, for an RNIC driver (currently) the same would happen. While with softiwarp this socket would be used for communication later, the RNIC driver would simply have to keep it around until the connection endpoint gets destroyed or the port gets unbound. Introducing a new kernel interface to bind a port w/o having to allocate a socket i would put on the wishlist for netdev. The more general issue - the proper integration of offloaded TCP with all the available tools for filtering, queueing ... of the kernel TCP stack, is the harder nut to crack and we should start discussing it. I propose to avoid any special treatment of RNIC devices at link or IP level, but, at least for now, make it visible per connection (only!) if a TCP connection is offloaded. A simple socket flag (visible via netstat etc.) could serve that purpose. Architecturally, network interfaces introduced by RNIC hardware should be able to serve normal L2 connectivity (used by any in-kernel connection endpoint) and offloaded iWARP connections at the same time, while sharing TCP port space with the kernel. The major argument for iWarp is link unification, and it should be extensible to flexible RDMA enablement at application level. And, single homed hosts with an RNIC should have plain TCP connectivity... For now and maybe forever, an offloaded connection would not fulfill the conditions to serve all the good additional features of an in-kernel connection. It would be up to the user to explicitly decide if he likes to have offloaded connections anyway. Some of the features might get supported by additional private communication between driver and offloaded connection - but i would restrict that to supporting functionality which does not impose any further changes to the kernel network stack (statistics etc. if possible). All other features would be known to be unavailable. Of course, a softiwarp connection would be visible as a normal in-kernel TCP connection. Maybe, that solution is to simple-minded and I miss some serious roadblocks. Please let me know. In any case, let's start discussing these things to come up with a reasonable solution to be further disussed with the responsible people. Many thanks, Bernard. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to [email protected] More majordomo info at http://vger.kernel.org/majordomo-info.html
