Andrew Gallatin wrote: > rajagopal kunhappan wrote: >> Andrew Gallatin wrote: >>>>> >>>>> FWIW. my tx path is very "lean and mean". The only time >>>>> locks are held are when writing the tx descriptors to the NIC, >>>>> and when allocating a dma handle from a pre-allocated per-ring pool. >>>>> >>>>> I thought the serializer was silly too, but PAE claimed a speedup >>>>> from it. I think that PAE claimed the speedup came from >>>>> never back-pressuring the stack when the host overran the >>>>> NIC. One of the "features" of the serializer was to always >>>>> block the calling thread if the tx queue was exhausted. >>>>> >>>>> Have you done any packets-per-second benchmarks with your >>>>> fanout code? I'm concerned that its very cache unfriendly >>>> >>>> With nxge we get line rate with MTU sized packets with 8 Tx rings. >>>> The numbers are similar to what is was with nxge serializer in place. >>>> >>>>> if you have a long run of packets all going to the same >>>>> destination. This is because you walk the mblk chain, reading >>>>> the packet headers and queue up a big chain. If the chain >>>>> gets too long, the mblk and/or the packet headers will be >>>>> pushed out of cache by the time they make it to the driver's >>>>> xmit routine. So in this case you could have twice as many >>>>> cache misses as normal when things get really backed up. >>>> >>>> We would like to have the drivers operate in non-serialized mode. >>>> But if for whatever reason, you want to use serialized mode, and >>>> there are issues, we can look into that, >>> >>> The only reason I care about the serializer is the pre-crossbow >>> feedback from PAE that the original serializer avoided >>> putting backpressure on the stack when the TX rings fill up. >> >> Yes, pre-crossbow putting back pressure would mean queue'ing up >> packets in DLD. Thus all packets get queued in DLD until the driver >> relieved the flow control. By then thousands of packets would be >> sitting in DLD (This is because TCP does not check for STREAM QFULL >> condition on the DLD write queue and keeps on sending packets). After >> flow control is relieved, the queued up packets are drained by >> dld_wsrv() (in single threaded mode). Single thread is no good on a >> 10gig link and thus caused performance issues. > > And crossbow addresses this?
Yes. Each Tx ring has its own queue (soft ring). If a Tx ring is flow controlled, the packets gets backed up in the soft ring associated with that Tx ring. Thus other Tx rings can continue to send out traffic. >>> I'm happy using the normal fanout (with some caveats below) as >>> long as PAE doesn't complain about it later. >>> >>> The caveats being that I want an fanout mode that uses a >>> standard Toeplitz hash so as to maintain CPU locality. >> >> I am curious as to how you maintain CPU locality for Tx traffic. Can >> you give some details? >> >> On Solaris stack, if you have a bunch of say TCP connections sending >> traffic, they can come from any CPU on the system. By this I mean what >> CPU an application runs on is completely random unless you do CPU >> binding. >> >> I can see tying Rx traffic to a specific Rx ring and CPU. If it is a >> forwarding case, then one can tie an Rx ring to a Tx ring. > > > On OSes other than windows, this helps mainly on the TCP receive side, > in that acks will flow out the same CPU that handled the receive > (assuming a direct dispatch from ISR through to the TCP/IP stack). > > AFAIK, only Windows can really control affinity to a fine level, since > they require a toeplitz hash, and you must provide hooks to use their > "key", and to update your indirection table. This means they can > control affinities for connections (or at least sets of connections) > and update them on the fly to match the application's affinity. > According to our Windows guy, they really use this stuff. > > But all of this depends on the OS and the NIC agreeing on the hash. > Is there any reason (patents? complexity? perception that the > windows solution is inferior?) that crossbow does not try to take > the windows approach? Essentially all NICs available today that support > multiple RX queues also support all this other stuff that > Windows requires. Why not take advantage of it? We take advantage of this though not through Toeplitz hash. There are some things that are missing like being able to retarget an MSI-x interrupt to a CPU of our choice. Work is underway to have APIs to do this. Once we have this, we can have the poll thread run on the same CPU as the MSI-x interrupt that is associated with an Rx ring. We can further align other threads that take part in processing the incoming Rx traffic to use CPUs that belong to the same socket (same socket meaning CPUs sharing common l2 cache). -krgopi --
