Andrew Gallatin wrote: >>> >>> FWIW. my tx path is very "lean and mean". The only time >>> locks are held are when writing the tx descriptors to the NIC, >>> and when allocating a dma handle from a pre-allocated per-ring pool. >>> >>> I thought the serializer was silly too, but PAE claimed a speedup >>> from it. I think that PAE claimed the speedup came from >>> never back-pressuring the stack when the host overran the >>> NIC. One of the "features" of the serializer was to always >>> block the calling thread if the tx queue was exhausted. >>> >>> Have you done any packets-per-second benchmarks with your >>> fanout code? I'm concerned that its very cache unfriendly >> >> With nxge we get line rate with MTU sized packets with 8 Tx rings. The >> numbers are similar to what is was with nxge serializer in place. >> >>> if you have a long run of packets all going to the same >>> destination. This is because you walk the mblk chain, reading >>> the packet headers and queue up a big chain. If the chain >>> gets too long, the mblk and/or the packet headers will be >>> pushed out of cache by the time they make it to the driver's >>> xmit routine. So in this case you could have twice as many >>> cache misses as normal when things get really backed up. >> >> We would like to have the drivers operate in non-serialized mode. But >> if for whatever reason, you want to use serialized mode, and there are >> issues, we can look into that, > > The only reason I care about the serializer is the pre-crossbow > feedback from PAE that the original serializer avoided > putting backpressure on the stack when the TX rings fill up.
Yes, pre-crossbow putting back pressure would mean queue'ing up packets in DLD. Thus all packets get queued in DLD until the driver relieved the flow control. By then thousands of packets would be sitting in DLD (This is because TCP does not check for STREAM QFULL condition on the DLD write queue and keeps on sending packets). After flow control is relieved, the queued up packets are drained by dld_wsrv() (in single threaded mode). Single thread is no good on a 10gig link and thus caused performance issues. > I'm happy using the normal fanout (with some caveats below) as > long as PAE doesn't complain about it later. > > The caveats being that I want an fanout mode that uses a > standard Toeplitz hash so as to maintain CPU locality. I am curious as to how you maintain CPU locality for Tx traffic. Can you give some details? On Solaris stack, if you have a bunch of say TCP connections sending traffic, they can come from any CPU on the system. By this I mean what CPU an application runs on is completely random unless you do CPU binding. I can see tying Rx traffic to a specific Rx ring and CPU. If it is a forwarding case, then one can tie an Rx ring to a Tx ring. Thanks, -krgopi > Or a hook so I can implement my own tx side hashing. > > Drew --
