> For example, my idea to allow ESTABLISHED TCP socket demux to be done > before netfilter is flawed. Connection tracking and NAT can change > the packet ID and loop it back to us to hit exactly an ESTABLISHED TCP > socket, therefore we must always hit netfilter first.
Hmm, how does this happen? I guess either when a connection is masqueraded and an application did a bind() on a local port that is used by the masquerading engine. That could be handled by just disallowing it. Or when you have a transparent proxy setup with the proxy on the local host. Perhaps in that case netfilter could be taught to reinject packets in a way that they hit another ESTABLISHED lookup. Did I miss a case? > All the original costs of route, netfilter, TCP socket lookup all > reappear as we make VJ netchannels fit all the rules of real practical > systems, eliminating their gains entirely. At least most of the optimizations from the early demux scheme could be probably gotten simpler by adding a fast path to iptables/conntrack/etc. that checks if all rules only check SYN etc. packets and doesn't walk the full rules then (or more generalized a fast TCP flag mask check similar to what TCP does). With that ESTABLISHED would hit TCP only with relatively small overhead. > I will also note in > passing that papers on related ideas, such as the Exokernel stuff, are > very careful to not address the issue of how practical 1) their demux > engine is and 2) the negative side effects of userspace TCP > implementations. For an example of the latter, if you have some 1GB > JAVA process you do not want to wake that monster up just to do some > ACK processing or TCP window updates, yet if you don't you violate > TCP's rules and risk spurious unnecessary retransmits. I don't quite get why the size of the process matters here - if only some user space TCP library is called directly then it shouldn't really matter how big or small the rest of the process is. Or did you mean migration costs as described below? But on the other hand full user space TCP seems to me of little gain compared to a process context implementation. I somehow like it better to hide these implementation details in the kernel. > Furthermore, the VJ netchannel gains can be partially obtained from > generic stateless facilities that we are going to get anyways. > Networking chips supporting multiple MSI-X vectors, choosen by hashing > the flow ID, can move TCP processing to "end nodes" which are cpu > threads in this case, by having each such MSI-X vector target a > different cpu thread. The problem with the scheme is that to do process context processing effectively you would need to teach the scheduler to aggressively migrate on wake up (so that the process ends up on the CPU that was selected by the hash function in the NIC). But what do you do when you have lots of different connections with different target CPU hash values or when this would require you to move multiple compute intensive processes or a single core? Without user context TCP, but using softirqs instead, it looks a bit better because you can at least use different CPUs to do the ACK processing etc. and the hash function spreading out connections over your CPUs doesn't harm. But you still have relatively high cache line transfer costs in handing over these packet from the softirq CPUs to the final process consumer. I liked VJ's idea of using arrays-of-something instead of lists for that to avoid some cache line transfers. Ok at least it sounds nice in theory - haven't seen any hard numbers on this scheme compared to a traditional double linked list. -Andi - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html